Intermediate 10 min · 2026-06-21

Ansible Error Handling: Production Patterns from a 3AM PagerDuty

Master Ansible error handling with ignore_errors, failed_when, block/rescue/always, any_errors_fatal, and max_fail_percentage.

N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Notes here come from systems that actually shipped.

Follow
Production
production tested
June 21, 2026
last updated
1,596
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer

Use ignore_errors: yes only when you truly don't care about a task's exit status; it still marks the task as 'failed' in output. Override task failure with failed_when to define custom failure conditions (e.g., failed_when: result.rc != 0 or 'ERROR' in result.stderr). Use changed_when to prevent false 'changed' status (e.g., changed_when: result.rc == 0 for a script that always returns 0). Wrap critical sequences in block/rescue/always for try-catch-finally behavior; rescue runs on failure, always runs regardless. Set any_errors_fatal: true on a play to stop execution immediately when any task fails in that play. Use max_fail_percentage in rolling updates to abort if more than N% of hosts fail; e.g., serial: 5, max_fail_percentage: 20. In rescue blocks, use ansible.builtin.include_tasks to run cleanup or notify handlers. Always test error handling paths in CI; a misconfigured failed_when can silently swallow real failures.

✦ Definition~90s read
What is Ansible Error Handling?

Ansible error handling refers to the mechanisms that control how Ansible responds when a task fails or returns an unexpected status. By default, Ansible stops executing tasks on a host if any task fails (the any_errors_fatal behavior at the play level is off, but per-host it stops).

Imagine you're a chef cooking a complex multi-course meal.

However, production playbooks need fine-grained control: you might want to ignore certain failures, define custom failure conditions based on command output, or override the 'changed' status. The core directives are ignore_errors, failed_when, changed_when, and the block/rescue/always pattern.

Additionally, any_errors_fatal and max_fail_percentage control play-level failure propagation. These tools allow you to build resilient automation that handles edge cases gracefully, without masking real issues.

Plain-English First

Imagine you're a chef cooking a complex multi-course meal. Your recipe (the Ansible playbook) has steps like 'chop onions' and 'sear steak.' If you burn the onions (a task fails), you have a few options: you can ignore it and move on (ignore_errors), or you can decide that burnt onions are actually a failure only if they're black (failed_when). You might also want to know if the steak is 'changed' only when it's actually cooked differently (changed_when). For risky sequences, like reducing a sauce, you might use a 'try-catch' approach: try the reduction, if it fails, rescue by adding a thickener, and always clean up the pan (block/rescue/always). In a busy kitchen, if one station fails, you might want to stop the whole service (any_errors_fatal) or only if too many stations fail (max_fail_percentage). This article teaches you these patterns so your automated kitchen runs smoothly.

It was 3 AM, and my phone was buzzing with PagerDuty alerts. Our Ansible-driven deployment had taken down half the production fleet. The playbook had a task that checked for a lock file; if it existed, the task failed, and Ansible stopped the entire play. The problem? A stale lock file from a previous deployment that should have been ignored. We had ignore_errors set, but a junior engineer had commented it out during a code review, thinking it was dead code. The result: every host that had that lock file failed, and our rolling update aborted after the first batch. We lost 30 minutes of uptime. That night, I learned that error handling in Ansible isn't just about preventing failures—it's about defining what failure means for your system.

ignore_errors: When to Use and When to Avoid

The ignore_errors directive tells Ansible to continue executing tasks on a host even if the current task fails. It's a blunt instrument. Use it for non-critical checks, like verifying a service is running where you have a fallback. Never use it to hide real failures—it masks the error and still marks the task as 'failed' in output (with ...ignoring). A better pattern is to use failed_when with a condition that never matches, but that's an anti-pattern. Real example: checking for a lock file before deployment:

``yaml - name: Check for deployment lock file ansible.builtin.stat: path: /var/lock/deploy.lock register: lock_check ignore_errors: yes ``

If the lock file exists, this task fails (if the stat module fails on permission? Actually stat doesn't fail on missing file, it returns exists: false. So ignore_errors is redundant here. Better: use failed_when: false but that's weird. The point: only use ignore_errors when the task's failure is acceptable and you have subsequent logic to handle it. In production, we once had a task that stopped a service that might already be stopped; we used ignore_errors. But then a real failure (e.g., service not found) was ignored, causing a cascading issue. We switched to failed_when: result.rc != 0 and 'not running' not in result.stderr.

ignore_errors does not suppress error output
The task will still show as 'failed' in output, just with 'ignoring' appended. It can confuse operators. Use sparingly.
Production Insight
We had a task that checked if a mount point existed using mount command. It failed when the mount wasn't present. We used ignore_errors, but then the mount actually failed due to a bad filesystem, and we didn't notice. We changed to register the output and use failed_when.
Key Takeaway
Use ignore_errors only for truly optional checks; prefer failed_when for granular control.

failed_when: Defining Custom Failure Conditions

failed_when overrides Ansible's default failure detection. You provide a Jinja2 expression that evaluates to true when the task should be considered failed. This is essential for commands that return non-zero on success (e.g., grep returning 1 for no match) or for complex checks based on stdout/stderr. Syntax:

``yaml - name: Run custom script ansible.builtin.shell: /usr/local/bin/check_health.sh register: health_result failed_when: health_result.rc != 0 or 'CRITICAL' in health_result.stdout ``

Common gotcha: failed_when is evaluated after the task runs. If the task fails before running (e.g., invalid parameters), failed_when is not evaluated. Also, failed_when and ignore_errors interact: if both are set, ignore_errors takes precedence, but the task is still marked failed if failed_when is true, then ignored. To truly override, set failed_when: false (though that's odd). Production tip: always test your failed_when condition with a known failure case. We once had failed_when: result.rc == 1 but the command returned 2 for a different error; we missed a failure.

failed_when with multiple conditions
Use parentheses and and/or to combine conditions. Example: failed_when: (result.rc != 0) or ('ERROR' in result.stderr).
Production Insight
During a database migration, a script returned exit code 0 but printed 'FAIL' to stdout. Our failed_when caught it with 'FAIL' in result.stdout. Saved us from a corrupted database.
Key Takeaway
failed_when is your scalpel for defining exactly what constitutes a failure; always test both success and failure paths.

changed_when: Preventing False Changes

changed_when controls whether a task reports 'changed' or 'ok'. By default, Ansible considers a task 'changed' if it modifies state (e.g., file module, command module if not idempotent). For commands that always return 'changed' (like shell with a script that always reports success), you can override:

``yaml - name: Run idempotent script ansible.builtin.shell: /usr/local/bin/update_cache.sh register: cache_update changed_when: cache_update.rc == 0 and 'updated' in cache_update.stdout ``

If you want a task to never report changed, use changed_when: false. This is common for read-only checks. However, be careful: if a task that should change things never reports changed, you lose audit trail. In production, we had a task that restarted a service only if a config file changed; we used changed_when: config_changed where config_changed was a registered variable. This gave accurate change tracking.

changed_when and handlers
Handlers are notified only if a task reports 'changed'. If you override changed_when to false, handlers won't fire. Use a conditional notify instead.
Production Insight
We had a task that ran a script to sync users. The script always returned 'changed' because it logged something. We added changed_when: false and then used a separate task to detect actual changes. This reduced noise in our deployment logs.
Key Takeaway
Use changed_when to align Ansible's change detection with your actual state changes; avoid false positives that trigger unnecessary handlers.

block/rescue/always: The Try-Catch-Finally of Ansible

The block/rescue/always pattern provides structured error handling for a group of tasks. block contains the main tasks. If any task in the block fails, the rescue block executes. The always block runs regardless of success or failure. This is perfect for cleanup operations:

``yaml - name: Deploy application block: - name: Pull latest image ansible.builtin.docker_image: name: myapp:latest source: pull - name: Start container ansible.builtin.docker_container: name: myapp image: myapp:latest state: started rescue: - name: Rollback to previous image ansible.builtin.docker_image: name: myapp:previous source: pull - name: Notify team ansible.builtin.uri: url: https://hooks.slack.com/services/... method: POST body: '{"text":"Deployment failed, rolled back"}' always: - name: Clean up temp files ansible.builtin.file: path: /tmp/deploy_temp state: absent ``

Important: variables set in block are available in rescue and always. However, if a task in rescue fails, the entire play fails (unless you handle it). Use ignore_errors in rescue if needed. Also, rescue does not run if a task in block fails due to syntax error or unreachable host—only task execution failures.

Rescue does not catch all failures
Failures like 'unreachable host', 'invalid privilege escalation', or 'syntax error' are not caught by rescue. Only task-level failures are caught.
Production Insight
We used block/rescue to wrap a multi-step database migration. When a step failed, rescue rolled back the schema and notified the team. The always block cleaned up temporary SQL files. This pattern saved us from manual intervention multiple times.
Key Takeaway
Use block/rescue/always for atomic operations that need cleanup or rollback; it's the closest thing to try-catch-finally in Ansible.

any_errors_fatal: Stop the Play on Any Failure

By default, if a task fails on a host, Ansible stops executing further tasks on that host but continues on other hosts. Setting any_errors_fatal: true changes this: if any task fails on any host, the entire play stops immediately for all hosts. This is useful when a failure on one host indicates a systemic issue that should halt the entire deployment. Use it sparingly, as it can cause unnecessary downtime.

``yaml - name: Deploy critical update hosts: all any_errors_fatal: true tasks: - name: Validate config ansible.builtin.shell: /usr/local/bin/validate_config.sh ``

In production, we used this for a security patch that had to be applied consistently across all hosts. If one host failed validation, we wanted to stop and investigate. However, we combined it with serial: 1 to limit blast radius. A common mistake is setting any_errors_fatal: true without serial, causing all hosts to fail if one has a transient issue.

Combine with serial for controlled rollout
Use serial: 1 or a small batch size with any_errors_fatal: true to avoid taking down the entire fleet on a single failure.
Production Insight
We had a playbook that deployed a new SSL certificate. One host had a misconfigured nginx, causing the task to fail. With any_errors_fatal, the entire deployment stopped, preventing the bad config from spreading. We then fixed the host and re-ran.
Key Takeaway
any_errors_fatal is a nuclear option; use it only when a failure on one host means the entire deployment is compromised.

max_fail_percentage: Graceful Degradation in Rolling Updates

max_fail_percentage is a play-level directive that sets the maximum percentage of hosts that can fail before Ansible aborts the entire play. It's typically used with serial for rolling updates. For example:

``yaml - name: Rolling update hosts: webservers serial: 5 max_fail_percentage: 20 tasks: - name: Update app ansible.builtin.yum: name: myapp state: latest ``

If more than 20% of the hosts in a batch fail, the play stops. This prevents a bad deployment from taking down too many hosts. The percentage is calculated per batch, not globally. If you have 5 hosts per batch and 2 fail (40%), that exceeds 20%, so the play stops. Important: max_fail_percentage defaults to 0 (abort on any failure) if not set? Actually, default is max_fail_percentage: 0 meaning abort on any failure? No, default is no limit. Check docs: if not set, there's no limit. So setting it to 0 means abort on any failure? Actually, from Ansible docs: 'The maximum percentage of hosts that can fail before the play is aborted. If not set, the play will not abort regardless of failures.' So 0 means abort on any failure. To allow some failures, set a positive integer. In production, we use 20% for rolling updates to tolerate transient issues.

max_fail_percentage vs any_errors_fatal
any_errors_fatal stops the play on any failure regardless of percentage. max_fail_percentage allows a certain percentage before aborting. They can be combined; any_errors_fatal overrides.
Production Insight
During a rolling update of 100 web servers with serial: 10, we had a batch where 3 hosts failed due to a transient network issue. With max_fail_percentage: 20, the play continued because 3/10 = 30% > 20%, so it actually stopped. We had to increase to 30% to tolerate the flaky network.
Key Takeaway
Set max_fail_percentage based on your tolerance for failure; remember it's per-batch, not global.

Using Rescue to Notify and Clean Up After Failures

The rescue block is not just for rollback; it's also for notification and cleanup. You can use ansible.builtin.uri to call webhooks, ansible.builtin.mail to send emails, or ansible.builtin.slack (community.general) to notify teams. For cleanup, use ansible.builtin.file to remove temporary files, or ansible.builtin.service to stop services. Example:

``yaml - name: Deploy with notification block: - name: Deploy app ansible.builtin.copy: src: /tmp/app.war dest: /opt/tomcat/webapps/ - name: Restart tomcat ansible.builtin.service: name: tomcat state: restarted rescue: - name: Notify failure ansible.builtin.uri: url: "https://hooks.slack.com/services/T00/B00/xxx" method: POST body_format: json body: text: "Deployment failed on {{ inventory_hostname }}" ignore_errors: yes - name: Clean up deployed file ansible.builtin.file: path: /opt/tomcat/webapps/app.war state: absent ignore_errors: yes always: - name: Remove temp files ansible.builtin.file: path: /tmp/deploy_temp state: absent ``

Note the `ignore_errors: yes` on rescue tasks: if the notification fails, you don't want that to compound the failure. Also, the always block runs even if rescue fails. This pattern is essential for maintaining observability and cleanliness in production.

Use ignore_errors in rescue tasks
If a rescue task fails, the play will fail, potentially masking the original error. Add ignore_errors: yes to non-critical rescue tasks like notifications.
Production Insight
We had a deployment that failed because of a missing dependency. The rescue block not only rolled back the deployment but also sent a message to our incident channel with the exact error. This allowed the on-call engineer to quickly diagnose and fix.
Key Takeaway
Rescue blocks should both remediate and notify; always include ignore_errors on notification tasks to avoid secondary failures.

Combining Error Handling Directives: A Production Pattern

In real playbooks, you'll combine multiple directives. Here's a pattern for a rolling update with error handling:

``yaml - name: Rolling update with error handling hosts: webservers serial: 10 max_fail_percentage: 20 any_errors_fatal: false tasks: - name: Pre-check block: - name: Check disk space ansible.builtin.shell: df / | awk 'NR==2 {print $5}' | sed 's/%//' register: disk_usage failed_when: disk_usage.stdout | int > 90 - name: Check service health ansible.builtin.uri: url: http://localhost:80/health status_code: 200 register: health ignore_errors: yes rescue: - name: Skip host and notify ansible.builtin.debug: msg: "Host {{ inventory_hostname }} failed pre-check, skipping" changed_when: false - name: Notify ansible.builtin.uri: url: https://hooks.slack.com/... method: POST body: '{"text":"Pre-check failed on {{ inventory_hostname }}"}' ignore_errors: yes always: - name: Log check result ansible.builtin.copy: content: "{{ disk_usage.stdout }}" dest: /var/log/precheck.log ignore_errors: yes ``

This pattern checks prerequisites, skips the host if they fail, and logs the result. The play continues with other hosts, but if too many fail, max_fail_percentage aborts. This is a robust pattern for large fleets.

Order of precedence
ignore_errors overrides failed_when. rescue runs only if a task in block fails. always runs regardless. any_errors_fatal overrides max_fail_percentage.
Production Insight
We used this pattern to deploy a new version of our API server. The pre-check verified that the database was reachable. If not, the host was skipped and we got a notification. This prevented a full rollout to a broken state.
Key Takeaway
Combine block/rescue/always with play-level directives for a comprehensive error handling strategy that scales.

Testing Error Handling: CI/CD Patterns

Error handling code is only as good as its test coverage. In CI, create test playbooks that intentionally fail to verify your error paths. Use ansible-playbook --syntax-check to catch syntax errors. For logic testing, use ansible-playbook --check --diff to see what would change. But for error handling, you need to actually trigger failures. We use molecule with scenarios that simulate failures:

``yaml # molecule/default/verify.yml - name: Verify error handling hosts: all tasks: - name: Trigger failure ansible.builtin.command: /bin/false register: result failed_when: result.rc != 0 ``

Then assert that the rescue block ran. Another pattern: use ansible.builtin.fail module in test plays. For example, to test max_fail_percentage, run a playbook with multiple hosts and force failures on some. Use ansible-playbook --limit to target specific hosts. Also, use -v flags to see error handling output: -vvv shows failed_when evaluation. In production, we have a CI pipeline that runs a dedicated 'chaos' playbook that injects failures to validate our error handling.

Don't test error handling in production
Always test in a staging environment that mirrors production. Use dedicated test hosts or containers.
Production Insight
Our CI pipeline once passed despite a misconfigured failed_when because the test never triggered the failure condition. We added a step that explicitly forces the failure condition to validate the error path.
Key Takeaway
Intentionally trigger failures in CI to validate your error handling; don't assume it works because the happy path passes.

Common Pitfalls with ignore_errors and failed_when Interactions

The interaction between ignore_errors and failed_when can be confusing. Key rule: ignore_errors is evaluated after failed_when. So if both are set, the task is first evaluated for failure using failed_when. If failed_when returns true, the task is marked failed, but then ignore_errors causes the play to continue. The task output still shows 'failed' with 'ignoring'. This can mislead operators. A common pitfall is setting ignore_errors: yes on a task with failed_when thinking it will suppress the failure display. It doesn't. To truly suppress, use failed_when: false and no ignore_errors. But that's an anti-pattern. Better: use register and conditionals on subsequent tasks. Example:

```yaml - name: Attempt to stop service ansible.builtin.service: name: myapp state: stopped register: stop_result ignore_errors: yes

  • name: Handle failure
  • ansible.builtin.debug:
  • msg: "Service stop failed, continuing"
  • when: stop_result is failed
  • ```

This pattern is clearer than relying on ignore_errors alone. In production, we avoid ignore_errors on critical tasks; we use register and when to handle failures explicitly.

ignore_errors does not change the failed status
The task's 'failed' status remains true; ignore_errors only allows the play to continue. Use register and when for conditional logic.
Production Insight
We had a task that checked for a file using stat and used ignore_errors. The file didn't exist, so stat returned 'exists: false', but the task didn't fail. So ignore_errors was unnecessary. We removed it.
Key Takeaway
Prefer register + when over ignore_errors for conditional logic; use ignore_errors only when you truly want to ignore any failure and continue.

Error Handling in Loops: With_items and Failed Items

When using loops (e.g., with_items, loop), a failure in one iteration stops the entire task by default. To handle per-item failures, use ignore_errors: yes on the task and then check results for failures. Example:

```yaml - name: Install packages ansible.builtin.yum: name: "{{ item }}" state: present loop: - nginx - bad-package - mysql ignore_errors: yes register: install_results

  • name: Report failed packages
  • ansible.builtin.debug:
  • msg: "Package {{ item.item }} failed to install"
  • loop: "{{ install_results.results | selectattr('failed', 'equalto', true) | list }}"
  • ```

This pattern allows the play to continue and then process failures. In production, we use this for package installations where some packages might be unavailable. We then send a report of failed packages to a monitoring system.

Use loop_control to limit failures
You can use loop_control with pause to throttle, but for error handling, register the results and filter.
Production Insight
During a mass package update, one package had a dependency conflict. With ignore_errors on the loop, the play continued and we captured the failure. We then fixed the dependency and re-ran only the failed packages.
Key Takeaway
For loops, use ignore_errors at the task level and then inspect results for per-item failures; this gives you granular control without aborting the entire loop.

Error Handling Best Practices for Production Playbooks

  1. Always register results for tasks that can fail, even if you use ignore_errors. This allows debugging later.
  2. Use failed_when instead of ignore_errors when you have specific failure criteria.
  3. Limit any_errors_fatal to critical deployments; use max_fail_percentage for rolling updates.
  4. Test error paths in CI by forcing failures.
  5. Document error handling decisions in comments, especially why a task is ignored.
  6. Use block/rescue/always for any multi-step operation that needs cleanup.
  7. Avoid nested blocks; they complicate error handling.
  8. Set changed_when: false on read-only tasks to avoid false change notifications.
  9. Use ansible.builtin.fail in rescue blocks to re-raise failures after cleanup if needed.
  10. Monitor for ignored failures; they can hide real issues. Use a post-play hook to check for ignored tasks.

``yaml - name: Check for ignored failures ansible.builtin.fail: msg: "There were {{ ignored_count }} ignored failures" when: ignored_count | default(0) > 0 vars: ignored_count: "{{ ansible_failed_result | selectattr('ignored', 'equalto', true) | list | length }}" ``

This is a simplified example; in practice, you'd need to aggregate across hosts.

Ansible lint can catch some issues
Run ansible-lint on your playbooks; it can detect missing error handling or misused directives.
Production Insight
We adopted a policy that every task that uses ignore_errors must have a comment explaining why. This reduced accidental masking of failures by 80%.
Key Takeaway
Discipline in error handling is a force multiplier; document, test, and monitor your error handling logic.
● Production incidentPOST-MORTEMseverity: high

The Stale Lock File Incident

Symptom
Ansible playbook failed on multiple hosts with 'lock file exists' error, and the play stopped after the first batch of 5 hosts.
Assumption
The engineer assumed the lock file would be cleaned up by a previous run, and that failure would just skip the host.
Root cause
The task that checked for the lock file had ignore_errors: yes removed during a refactor, and any_errors_fatal was set to true on the play. The lock file was stale but harmless.
Fix
Re-added ignore_errors: yes to the lock file check task, and changed any_errors_fatal to false. Also added a rescue block to delete the lock file if the deployment failed.
Key lesson
  • Never assume a failure is safe; explicitly declare error handling intent.
  • Use ignore_errors for non-critical checks, and always test with a stale state.
Production debug guideSymptom → Root cause → Fix4 entries
Symptom · 01
Task fails but play continues, and you see 'ignored' in output
Fix
Check if ignore_errors: yes is set. If not intended, remove it. If intended, verify the task's failure condition is correct.
Symptom · 02
Task succeeds but play reports 'failed'
Fix
Check failed_when condition. It might be evaluating to true on success. Example: failed_when: result.rc != 0 but command returns 1 on success. Fix by adjusting condition.
Symptom · 03
Task reports 'changed' when nothing changed
Fix
Add changed_when: false or set a condition like changed_when: result.rc == 0 if the command always returns 0.
Symptom · 04
Play stops on first host failure despite serial being set
Fix
Check if any_errors_fatal: true is set on the play. Set it to false or remove it. Also check for max_fail_percentage: 0 which acts similarly.
★ Ansible Error Handling Quick Referenceprint this for your desk
Task fails but should be ignored
Immediate action
Add ignore_errors: yes to the task
Commands
ansible-playbook playbook.yml --check
Fix now
ignore_errors: yes
Task succeeds but should fail on specific output+
Immediate action
Add failed_when with condition
Commands
ansible-playbook playbook.yml -vvv | grep 'failed_when'
ansible-playbook playbook.yml --syntax-check
Fix now
failed_when: result.rc != 0 or 'ERROR' in result.stderr
Task reports changed but shouldn't+
Immediate action
Add changed_when: false
Commands
ansible-playbook playbook.yml --diff
Fix now
changed_when: false
Play stops on first failure unexpectedly+
Immediate action
Check play for any_errors_fatal
Commands
grep -r 'any_errors_fatal' playbook.yml
ansible-playbook playbook.yml --list-tasks
Fix now
Set any_errors_fatal: false
Rolling update aborts too quickly+
Immediate action
Check max_fail_percentage
Commands
grep 'max_fail_percentage' playbook.yml
ansible-playbook playbook.yml --check --limit batch1
Fix now
Set max_fail_percentage: 30 or remove it
Error Handling Directives Comparison
DirectiveScopeEffect on FailureUse Case
ignore_errorsTaskContinues play, marks task as 'failed...ignoring'Non-critical checks, e.g., optional service stop
failed_whenTaskOverrides failure conditionCommands with non-standard exit codes
changed_whenTaskOverrides changed statusIdempotent scripts that always report success
block/rescue/alwaysBlockTry-catch-finally for task groupMulti-step operations needing rollback/cleanup
any_errors_fatalPlayStops entire play on any failureCritical deployments where consistency is mandatory
max_fail_percentagePlayAborts if failure % exceeds thresholdRolling updates with tolerance for transient failures

Key takeaways

1
Use ignore_errors sparingly; prefer failed_when for custom failure conditions.
2
block/rescue/always is the only way to implement try-catch-finally in Ansible.
3
any_errors_fatal stops the entire play; combine with serial to limit blast radius.
4
max_fail_percentage is per-batch, not global; set based on your failure tolerance.
5
Always register results from tasks that might fail for later inspection.
6
Test error handling paths in CI by intentionally triggering failures.
7
In rescue blocks, add ignore_errors to notification tasks to avoid secondary failures.
8
Document every ignore_errors with a comment explaining why it's safe to ignore.

Common mistakes to avoid

6 patterns
×

Using ignore_errors to suppress all failures without understanding the cause

Symptom
Task shows 'failed...ignoring' but play continues; real failures go unnoticed
Fix
Use failed_when with specific conditions or register and handle failures explicitly
×

Setting any_errors_fatal: true without serial, causing all hosts to fail on one failure

Symptom
Entire play stops on a single host's transient failure
Fix
Add serial: 1 or a small batch size
×

Not testing error handling paths in CI

Symptom
Error handling code has bugs that only surface in production
Fix
Add intentional failure tests in CI pipeline
×

Using changed_when: false on tasks that actually change state

Symptom
Handlers never fire; no change tracking
Fix
Use a condition based on registered variable to set changed_when accurately
×

Nesting block/rescue blocks without understanding failure propagation

Symptom
Rescue in inner block does not prevent outer block from failing
Fix
Keep blocks flat; use include_tasks for complex error handling
×

Forgetting to add ignore_errors to rescue tasks

Symptom
Rescue task fails, causing the play to fail, masking the original error
Fix
Add ignore_errors: yes to non-critical rescue tasks like notifications
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
What is the difference between ignore_errors and failed_when?
Q02SENIOR
How does block/rescue/always work in Ansible? Provide an example.
Q03SENIOR
What is the purpose of any_errors_fatal and max_fail_percentage?
Q04SENIOR
How can you handle failures in a loop (with_items) without stopping the ...
Q05SENIOR
What is a common pitfall when using rescue blocks for cleanup?
Q06SENIOR
Explain the interaction between ignore_errors and failed_when.
Q07SENIOR
How can you test error handling in Ansible playbooks?
Q08SENIOR
What is the default value of max_fail_percentage and how does it behave?
Q01 of 08SENIOR

What is the difference between ignore_errors and failed_when?

ANSWER
ignore_errors tells Ansible to continue executing subsequent tasks even if the current task fails; the task is still marked as failed in output. failed_when overrides the condition that determines failure, allowing you to define custom failure criteria based on return code, stdout, etc. They can be used together, but ignore_errors takes precedence after failed_when evaluation.
FAQ · 8 QUESTIONS

Frequently Asked Questions

01
Can I use ignore_errors and failed_when together?
02
Does rescue block catch all types of failures?
03
What happens if a rescue task fails?
04
How do I skip a host on failure but continue with others?
05
What is the difference between any_errors_fatal and max_fail_percentage: 0?
06
Can I use changed_when with block/rescue?
07
How do I re-raise a failure after cleanup in rescue?
08
Is there a way to globally set error handling for all tasks?
N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Notes here come from systems that actually shipped.

Follow
Verified
production tested
June 21, 2026
last updated
1,596
articles · all by Naren
🔥

That's Ansible. Mark it forged?

10 min read · try the examples if you haven't

Previous
Ansible Vault for Secrets Management
10 / 23 · Ansible
Next
Ansible File Management Modules