AutoSys Fault Tolerance and Recovery — Building Resilient Batch Workflows
- n_retrys handles transient failures automatically — set it on jobs prone to temporary external issues
- box_terminator: 1 stops the entire box when a critical job fails — use it on validation and pre-requisite checks
- term_run_time prevents hung jobs from blocking everything downstream indefinitely
- n_retrys: auto-retry a failed job up to N times for transient failures
- box_terminator: stop the entire box when a critical job fails
- Dual Event Server HA: infrastructure-level failover for the AutoSys scheduler
- alarm_if_fail: notify only after all retries exhausted — don't wake ops for blips
- term_run_time: kill hung jobs so downstream isn't blocked indefinitely
- Biggest mistake: setting n_retrys too high masks permanent failures and delays escalation by hours
AutoSys Fault Tolerance – Commands & Fixes
Job failed — want to retry manually
autorep -J job_name -wsendevent -E FORCE_START -J job_nameJob hung — never completes
sendevent -E KILLJOB -J job_nameautorep -J job_name -q | grep term_run_timeBox not stopping on critical failure
autorep -J job_name -q | grep box_terminatorsendevent -E CHANGE_STATUS -s STOP_ON_FAILURE -J box_nameEvent Server down, shadow not promoting
autoflags -a | grep -E 'primary|shadow'chk_auto_up -A -S SHADOW_INSTANCEProduction Incident
success() condition on the first processing job after validation so even if box_terminator is accidentally removed, the condition blocks execution.Production Debug GuideSymptom → Action quick reference for the most common production failures.
autorep -J job_name -q. Verify n_retrys is set. If 0, the job will not retry automatically. Add retry with sendevent -E CHANGE_STATUS -s ON_HOLD -J job_name then update JIL with n_retrys.box_terminator:0 (default). If needed, set box_terminator:1 and test. Also check if the box has box_terminator overridden at box level.autorep -J job_name -w to see job details. Check term_run_time — without it, the job will wait forever. Use sendevent -E KILLJOB -J job_name to force-stop.autoflags -a | grep shadow. Verify network connectivity and that SHADOW_INSTANCE is configured in autosys.conf. Test failover quarterly.alarm_if_fail:1 is set on the job. Verify notification rules in WCC or custom alarm scripts. Remember: if n_retrys > 0, the alarm only fires after all retries exhausted.Enterprise batch workflows run overnight when no one is watching. The jobs that matter most — payroll, settlement, reconciliation — are the ones where failures are most costly. Building fault tolerance into your AutoSys design means many failures recover automatically, and when they don't, the right people are notified with enough context to fix things quickly. It's not about eliminating failures — it's about controlling how they propagate and how fast you bounce back.
Automatic retry with n_retrys
The simplest fault tolerance mechanism. n_retrys tells AutoSys to automatically rerun a failed job N times before declaring it a final FAILURE. This handles transient failures like brief network blips or temporary database connection issues.
Here's the thing: each retry is a full new attempt — the job script runs again from scratch. AutoSys doesn't resume from where it left off. So if your job is not idempotent, retries can cause data duplication or corruption. For example, an INSERT without a uniqueness check will happily create duplicate rows on each retry. Make sure your scripts handle re-entry safely: use idempotency keys, checkpoints, or database MERGE (upsert) logic.
When setting n_retrys, choose a number that matches the expected transient window. If network blips last ~30 seconds, and your job runs in 2 minutes, n_retrys: 3 gives about 6 minutes of recovery time. That's enough for most intermittent issues without delaying the pipeline too much.
insert_job: extract_market_data job_type: CMD command: /scripts/extract_market.sh machine: data-server-01 owner: batchuser date_conditions: 1 days_of_week: all start_times: "18:00" n_retrys: 3 /* retry up to 3 times after initial failure = 4 total attempts */ alarm_if_fail: 1 /* alarm only after all retries exhausted */ term_run_time: 45 /* kill if running over 45 minutes */ std_err_file: /logs/autosys/extract_market_data.err
18:00:01 — Attempt 1: FAILURE (exit code 1)
18:00:31 — Retry 1: FAILURE (exit code 1)
18:01:01 — Retry 2: SUCCESS (exit code 0)
18:01:01 — extract_market_data: SUCCESS — downstream jobs proceed */
box_terminator — stopping the box on critical failure
In a BOX with multiple independent jobs, a failure in one job normally leaves other jobs to continue. If one job's failure should stop everything — because its output is required or its failure invalidates all subsequent work — mark it as a box_terminator.
When a job with box_terminator:1 fails, AutoSys immediately transitions the parent box to FAILURE. All currently pending inner jobs are skipped (their status becomes TERMINATED). Any jobs already running are killed. This prevents wasted compute on bad data and reduces the time to detect and recover.
- Data validation jobs (schema checks, referential integrity)
- Prerequisite extraction jobs (if upstream source is unavailable)
- Configuration or lookup table loads (everything depends on them)
Do not use box_terminator on jobs that have graceful degradation paths. If a downstream job can handle missing data (e.g., produce a partial report with a warning), let it run.
insert_job: validate_input_data job_type: CMD box_name: eod_box command: /scripts/validate.sh machine: server01 owner: batch box_terminator: 1 /* if this fails, the entire box fails immediately */ alarm_if_fail: 1 /* Without box_terminator: other jobs in the box would continue even after validate fails */ /* With box_terminator: box immediately moves to FAILURE, all pending inner jobs skip */
alarm_if_fail and notification — when to wake someone up
alarm_if_fail:1 tells AutoSys to trigger an alarm when a job fails. But the timing matters: if you also have n_retrys > 0, the alarm only fires after all retries are exhausted. That's the right behaviour for transient failures — you don't want the on-call engineer paged for a 30-second network glitch.
However, some jobs should always alarm on the first failure, regardless of retries. For those, consider splitting the job: set a dummy pre-step that does the retry logic, and the main job with alarm_if_fail:1 and n_retrys:0. Or use a different notification mechanism: a custom script that sends a page on exit code != 0.
In AutoSys, the alarm mechanism is typically configured in WCC or via an external event handler. The job attribute alarm_if_fail sets a flag that AutoSys propagates to the event server. Make sure your notification system (email, SMS, PagerDuty) is subscribed to these events. Many teams set up automated alerting rules that trigger on job status FAILURE, but if those rules don't respect the retry state, they may fire on every transient blip.
insert_job: critical_report_generation job_type: CMD box_name: eod_box command: /scripts/generate_report.sh machine: server01 owner: batch n_retrys: 0 /* no retries — always alarm on first failure */ alarm_if_fail: 1 /* alarm immediately */ /* Alternative: job that retries, but you want alarm on first failure too */ /* Use a wrapper script that pages on exit code 1 */ insert_job: critical_workflow job_type: CMD command: /scripts/run_and_page.sh /* wrapper does retry logic internally */ n_retrys: 0 alarm_if_fail: 1
HA architecture for fault tolerance
At the infrastructure level, AutoSys supports high availability through the dual Event Server architecture. For mission-critical batch environments, this is non-negotiable.
The setup involves two AutoSys instances: a primary and a shadow (standby) Event Server. They share a common file system (NFS) where the AutoSys database and binaries are stored. The shadow Event Server monitors the primary via a heartbeat. If the primary becomes unreachable, the shadow promotes itself to active within a configurable timeout (default is typically 5 minutes).
Important: the shadow is not an active-active cluster. Only one Event Processor runs jobs at a time. The shadow is a cold standby — it must be ready to take over but does not process jobs while the primary is healthy.
Failover is automatic, but not instantaneous. During the promotion period, no jobs are scheduled, no events are processed. If the failover happens during a critical window, that gap can cause SLAs to be missed. Consider scheduling maintenance windows around failover testing.
# Check which Event Server is currently primary autoflags -a | grep -i 'primary\|shadow\|active' # Verify shadow is in sync autoflags -a | grep -i 'shadow\|standby' # Check Event Processor status (should be RUNNING on primary) chk_auto_up -A # In a dual-server setup, this also shows shadow status # chk_auto_up -A -S SHADOW_INSTANCE
Event Server Role: PRIMARY (active)
Shadow Status: IN_SYNC
Event Processor: RUNNING
- Primary Event Server actively schedules and runs all jobs.
- Shadow Event Server watches the primary's heartbeat; it does not schedule jobs.
- On failure, the shadow promotes itself, reads the shared database, and starts processing.
- The shared file system must be highly available itself — if NFS goes down, failover fails.
- Failover takes time (typically 1–5 minutes) — jobs scheduled in that window are delayed.
Recovery jobs and manual intervention patterns
Even with automatic retries and box_terminators, some failures require human intervention. Recovery jobs are specially designed jobs that repair the state after a failure and allow the pipeline to resume from a clean point.
- Rollback jobs: Reverse the effects of a partially completed batch (e.g., delete inserted rows, restore files from backup).
- Re-run jobs: A job that reinitialises the pipeline after a failure — often a wrapper that truncates and re-imports data.
- Compensation jobs: Run after a failure to fix data integrity issues before the next cycle.
- Manual restart procedures: Documented steps to use sendevent to reset job statuses and re-trigger the box.
When designing recovery, think about idempotency: the recovery job should be safe to run multiple times if the first attempt also fails. Use checkpoints in your scripts: record completion steps in a control table so that rerunning the recovery job doesn't repeat already-completed actions.
A good practice is to separate recovery jobs into their own box with no dependencies on the business-critical timeline. Keep them available for ops to trigger via a JIL override or sendevent.
/* Recovery box: triggers after EOD box failure */ insert_job: daily_recovery group: recovery job_type: BOX condition: FAILURE(eod_box) /* runs only if eod_box fails */ start_times: "06:00" /* but manual trigger also works via sendevent */ /* Inside the recovery box */ insert_job: rollback_data box_name: daily_recovery command: /scripts/rollback.sh $CHECKPOINT $LAST_FAILED_STEP machine: server01 owner: batch insert_job: notify_recovery box_name: daily_recovery condition: success(rollback_data) command: /scripts/send_notification.sh "Recovery complete for eod_box" alarm_if_fail: 1
| Fault tolerance mechanism | What it handles | Configured where |
|---|---|---|
| n_retrys | Transient job failures (network blips) | Job definition attribute |
| box_terminator | Critical failure that should stop the whole box | Job definition attribute |
| term_run_time | Hung jobs that never complete | Job definition attribute |
| alarm_if_fail + notification | Human awareness and response | Job definition attributes |
| Dual Event Server (HA) | AutoSys server/infrastructure failure | AutoSys installation config |
| Remote Agent redundancy | Agent machine failure | Machine definitions + job failover logic |
| Recovery jobs | Post-failure state repair | Dedicated JIL definitions + manual trigger |
🎯 Key Takeaways
- n_retrys handles transient failures automatically — set it on jobs prone to temporary external issues
- box_terminator: 1 stops the entire box when a critical job fails — use it on validation and pre-requisite checks
- term_run_time prevents hung jobs from blocking everything downstream indefinitely
- alarm_if_fail only fires after all retries are exhausted — adjust retry count or use custom alerting for time-sensitive jobs
- Infrastructure-level fault tolerance requires the dual Event Server HA setup — test failover regularly
- Recovery jobs must be idempotent and testable in isolation — document manual restart procedures clearly
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QHow does n_retrys work in AutoSys and what are its limitations?SeniorReveal
- QWhat is box_terminator and when would you use it?Mid-levelReveal
- QWhat is the difference between fault tolerance at the job level and at the infrastructure level in AutoSys?SeniorReveal
- QIf a validation job fails, how do you ensure none of the downstream jobs in the box run?Mid-levelReveal
- QHow do you verify that AutoSys HA is working correctly?SeniorReveal
- QHow would you design a recovery plan for a critical batch box that failed at 2 AM?SeniorReveal
Frequently Asked Questions
How does n_retrys work in AutoSys?
n_retrys specifies how many automatic retries AutoSys performs after a job fails. With n_retrys: 3, the job runs up to 4 times total: the original attempt plus 3 retries. The alarm only fires (if alarm_if_fail: 1) after all retries are exhausted.
What is box_terminator in AutoSys?
box_terminator: 1 marks a job as the kill switch for its parent box. If this job fails, AutoSys immediately terminates the box and all remaining pending inner jobs. It's ideal for validation or prerequisite jobs whose failure makes all downstream processing meaningless.
How do I prevent downstream jobs from running after a failure?
Use `condition: success(upstream_job)` on downstream jobs, and/or use box_terminator: 1 on the critical upstream job. With success() conditions, downstream jobs only start when the upstream succeeds. With box_terminator, the entire box stops on failure.
How do I test AutoSys HA failover?
In a test environment, stop the primary Event Server and verify the shadow promotes automatically within the expected time. Check with autoflags -a that the shadow is now the primary, and verify that jobs continue to be scheduled correctly. Document the failover procedure and test it annually in production-equivalent environments.
Should I set n_retrys on every job?
Not necessarily. n_retrys is best for jobs that interface with external systems prone to transient failures (network services, external APIs, databases under load). For jobs with deterministic inputs and outputs, a single failure usually warrants human investigation rather than automatic retry.
What is a recovery job and when should I use one?
A recovery job is a dedicated job (or box) that performs state repair after a failure — rolling back partial changes, truncating tables, restoring files. Use it when automatic retries are insufficient and manual repair would be error-prone. Always make recovery jobs idempotent so they can be safely re-run.
How do I handle a job that is stuck in STARTING status?
A job stuck in STARTING usually means the Remote Agent is unreachable or the Event Processor cannot start the process. Check the agent status with autoping or autorep -m machine. If the agent is down, restart it. If the job is orphaned, kill it with sendevent -E KILLJOB -J job_name and then reschedule.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.