AutoSys Fault Tolerance: 5 Recovery Patterns That Work
- n_retrys handles transient failures — but monitor retry rate. >1% retry means fix the root cause.
- box_terminator: 1 on validation jobs stops bad data propagation immediately.
- term_run_time prevents infinite hangs. Every external-facing job needs it.
- n_retrys: AutoSys retries failed jobs N times. Handles network blips. Default alarm fires only after all retries exhaust.
- box_terminator: Stops the entire box when a critical job fails. Use on validation jobs — bad input shouldn't propagate.
- term_run_time: Hard kill after N minutes. Prevents hung jobs from blocking downstream workflows forever.
- Dual Event Server HA: Automatic failover takes 60-90 seconds. Running jobs continue; new jobs wait.
- The 3 AM lesson: Retries without root-cause fixes mask problems. Permanent failures still need humans.
Fault Tolerance — 60-Second Diagnosis
Job retrying too many times
autorep -J JOBNAME -q | grep n_retrysautorep -J JOBNAME -L 20 | grep FAILUREBox failed but no clear reason
autorep -J BOXNAME -d | grep 'FAILURE\|TERMINATED'autorep -J BOXNAME -q | grep box_terminatorJob hung, not terminating
autorep -J JOBNAME -q | grep term_run_timedate; autorep -J JOBNAME -q | grep 'start time'HA not failing over
autoflags -a | grep -E 'Primary|Shadow|Active'tail -50 $AUTOUSER/out/event_demon.* | grep -i failoverProduction Incident
Production Debug GuideWhen your recovery strategy doesn't recover
Enterprise batch workflows run overnight when no one's watching. The jobs that matter most — payroll, settlement, reconciliation — are the ones where failures cost the most.
Here's the problem most teams learn the hard way: retries mask flaky scripts until they don't. Box terminators stop bad data from propagating, but only if you put them in the right place. And HA failover? 60-90 seconds feels fast until it's your 2 AM SLA.
This isn't theory. These are the patterns that actually keep workflows alive when things break.
Automatic retry with n_retrys
The simplest fault tolerance mechanism. n_retrys tells AutoSys to automatically rerun a failed job N times before declaring it a final FAILURE. This handles transient failures like brief network blips or temporary database connection issues.
A hidden detail: n_retrys counts retries after the initial attempt. n_retrys: 3 means up to 4 total runs. The retry interval is controlled by the profile setting 'max_exit' default — usually 60 seconds between attempts.
Warning: alarm_if_fail fires only after ALL retries exhaust. If your job succeeds on retry 3, no alarm ever fires. This is good for transient failures but terrible for masking permanent issues.
insert_job: extract_market_data job_type: CMD command: /scripts/extract_market.sh machine: data-server-01 owner: batchuser date_conditions: 1 days_of_week: all start_times: "18:00" n_retrys: 3 /* retry up to 3 times after initial failure = 4 total attempts */ alarm_if_fail: 1 /* alarm only after all retries exhausted */ term_run_time: 45 /* kill if running over 45 minutes */ std_err_file: /logs/autosys/extract_market_data.err /* To alert on first failure regardless of retries — use separate monitoring */ /* Add a dummy dependency job that detects failure status via autorep */
18:00:01 — Attempt 1: FAILURE (exit code 1)
18:00:31 — Retry 1: FAILURE (exit code 1)
18:01:01 — Retry 2: SUCCESS (exit code 0)
18:01:01 — extract_market_data: SUCCESS — downstream jobs proceed */
box_terminator — stopping the box on critical failure
In a BOX with multiple independent jobs, a failure in one job normally leaves other jobs to continue. That's usually what you want — a reporting job failing shouldn't stop the data load.
But sometimes one job's failure should stop everything. If your validation step says 'input data is corrupt', there's zero point running the 50 downstream jobs. They'll just produce garbage.
box_terminator: 1 marks the kill switch. When that job fails, AutoSys immediately terminates the entire box. All pending inner jobs skip to TERMINATED state. The box status becomes FAILURE immediately — no waiting for other jobs to finish.
insert_job: validate_input_data job_type: CMD box_name: eod_box command: /scripts/validate.sh machine: server01 owner: batch box_terminator: 1 /* if this fails, the entire box fails immediately */ alarm_if_fail: 1 /* Without box_terminator: other jobs in the box would continue even after validate fails */ /* With box_terminator: box immediately moves to FAILURE, all pending inner jobs skip */ /* Box definition */ insert_job: eod_box job_type: BOX owner: batch date_conditions: 1 start_times: "23:00"
- Normal failure = other jobs continue. box_terminator failure = whole box stops.
- Only one job per box should be box_terminator — usually the first validation job.
- Box stays in FAILURE until manually restarted or conditionally cleared.
- Downstream jobs move to TERMINATED, not FAILURE. They never attempt to run.
term_run_time — preventing the infinite hang
A job that runs forever is worse than a job that fails. Failing at least triggers alerts and retries. Hanging just blocks everything downstream indefinitely.
term_run_time kills a job after N minutes from its start time. The count begins when the job starts (including retries — each retry resets the timer). When term_run_time expires, AutoSys sends a SIGTERM to the agent. The agent terminates the job process and updates status to TERMINATED.
Crucial difference: TERMINATED is NOT FAILURE. Conditions like success(job) won't trigger on TERMINATED. If you want downstream jobs to run after a timeout, you need condition: status(job) != 'RUNNING' or a custom wrapper script that checks exit codes.
insert_job: nightly_reconcile job_type: CMD command: /scripts/reconcile.sh machine: finance-server owner: batch date_conditions: 1 start_times: "23:00" term_run_time: 390 /* 6.5 hours — kill if still running at 5:30 AM */ run_window: "23:00 - 05:30" /* advisory only — term_run_time does the kill */ alarm_if_fail: 1 /* For downstream jobs that should run even if this times out: */ condition: success(nightly_reconcile) OR status(nightly_reconcile) = 'TERMINATED' /* Or better: wrapper script */ command: /scripts/reconcile_with_timeout.sh
HA architecture for fault tolerance
At the infrastructure level, AutoSys supports high availability through the dual Event Server architecture. For mission-critical batch environments, this is non-negotiable.
How it works: Primary Event Server handles all writes. Shadow Event Server maintains a real-time replica via database replication (Oracle Data Guard, Sybase Replication, etc.). The Event Processor monitors the primary through heartbeat checks (default 60 seconds).
When the heartbeat fails, the Event Processor promotes the shadow to primary. Total downtime: 60-90 seconds. During this window, running jobs continue unaffected. However, no new jobs start. The Event Processor queues events during failover and processes them once the new primary is online.
Critical nuance: Replication lag is your enemy. If the shadow is 5 minutes behind when the primary fails, you lose 5 minutes of events. Those job completions, status changes, and sendevent calls are gone.
# Check which Event Server is currently primary autoflags -a | grep -i 'primary\|shadow\|active' # Verify shadow is in sync autoflags -a | grep -i 'shadow\|standby' # Check Event Processor status (should be RUNNING on primary) chk_auto_up -A # Check replication lag (Oracle example) sqlplus autosys_user @check_lag.sql SELECT APPLIED_LAG FROM V$DATAGUARD_STATS; # Manual failover (test only) sendevent -E SWITCH_TO_SHADOW
Event Server Role: PRIMARY (active)
Shadow Status: IN_SYNC
Replication Lag: 0 seconds
Event Processor: RUNNING
| Fault tolerance mechanism | What it handles | What it doesn't handle | Configured where |
|---|---|---|---|
| n_retrys | Transient job failures (network blips, timeouts) | Permanent failures, logic bugs | Job definition attribute |
| box_terminator | Critical failure that should stop the whole box | Multiple failures — only one job can be terminator | Job definition attribute |
| term_run_time | Hung jobs that never complete | Slow-but-alive jobs (needs buffer) | Job definition attribute |
| Dual Event Server (HA) | AutoSys server/infrastructure failure | Agent failure, network partition between sites | AutoSys installation config |
| alarm_if_fail + notification | Human awareness and response | Automatic recovery — still needs humans | Job definition + external paging |
🎯 Key Takeaways
- n_retrys handles transient failures — but monitor retry rate. >1% retry means fix the root cause.
- box_terminator: 1 on validation jobs stops bad data propagation immediately.
- term_run_time prevents infinite hangs. Every external-facing job needs it.
- HA failover takes 60-90 seconds and needs quarterly testing. Untested HA is a trap.
- Retries mask problems. alert on first failure AND final failure — know the difference.
- Terminated ≠ Failed. Downstream jobs need explicit OR conditions to handle timeouts.
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QHow does n_retrys work in AutoSys?JuniorReveal
- QWhat is box_terminator and when would you use it?Mid-levelReveal
- QWhat is the difference between fault tolerance at the job level and at the infrastructure level in AutoSys?SeniorReveal
- QIf a validation job fails, how do you ensure none of the downstream jobs in the box run?Mid-levelReveal
- QHow do you verify that AutoSys HA is working correctly?SeniorReveal
Frequently Asked Questions
How does n_retrys work in AutoSys?
n_retrys specifies how many automatic retries AutoSys performs after a job fails. With n_retrys: 3, the job runs up to 4 times total: the original attempt plus 3 retries. The alarm only fires (if alarm_if_fail: 1) after all retries are exhausted.
Retry interval: controlled by the 'max_exit' profile setting (default 60 seconds). If a job succeeds on any retry, AutoSys treats it as SUCCESS and does not raise an alarm.
Important: n_retrys counts retries AFTER the first failure. n_retrys: 0 means no retries (fail once, alarm immediately).
What is box_terminator in AutoSys?
box_terminator: 1 marks a job as the kill switch for its parent box. If this job fails, AutoSys immediately terminates the box and all remaining pending inner jobs. The box status becomes FAILURE.
Use cases: Data validation jobs, prerequisite checks, file existence verification — any job whose failure makes downstream processing meaningless.
Anti-pattern: Do NOT mark optional jobs or cleanup jobs as box_terminator. Their failure should not stop the entire box.
How do I prevent downstream jobs from running after a failure?
Two approaches:
- box_terminator: 1 on the critical upstream job. When it fails, the entire box terminates. All pending downstream jobs skip to TERMINATED.
- condition: success(upstream_job) on each downstream job. Downstream jobs only start when the upstream succeeds — but this requires maintaining conditions on potentially many jobs.
Best practice: Use box_terminator on the first critical validation job. Use success() conditions on jobs that are downstream of the validation but within the same box for additional safety.
How do I test AutoSys HA failover?
In a staging environment:
- Verify replication is in sync: autoflags -a (look for 'SHADOW STATUS: IN_SYNC')
- Stop the primary Event Server (gracefully if possible): sendevent -E STOP_DEMON on the primary host, or stop the database listener.
- Monitor autoflags -a on the shadow — it should promote within 60-90 seconds.
- Verify Event Processor is running on the new primary: chk_auto_up -A
- Test job execution: submit a test job and verify it runs correctly.
- Restore the original primary and re-establish replication (procedure varies by DB).
Do this quarterly. Document every step. If you can't test automatically, test manually — but test.
Should I set n_retrys on every job?
Not necessarily. n_retrys is best for jobs that interface with external systems prone to transient failures (network services, external APIs, databases under load, cloud storage).
For jobs with deterministic inputs and outputs (data transformations, calculations, local file processing), a single failure usually warrants human investigation rather than automatic retry. The failure is likely a logic bug or missing data — retrying won't help.
General rule: Set n_retrys: 2 for external-facing jobs. Set n_retrys: 0 for purely computational jobs. Monitor retry rates regardless — if any job needs retries >1% of runs, investigate the root cause.
What's the difference between TERMINATED and FAILURE?
FAILURE: Job ran to completion but returned a non-zero exit code (or was killed by box_terminator on another job). AutoSys treats this as 'job tried and failed'.
TERMINATED: Job was forcibly killed without completing by term_run_time, FORCE_TERMINATE_JOB, or system signal. AutoSys treats this as 'job stopped without finishing'.
Key difference for dependencies: success(job) only fires on SUCCESS status. It does NOT fire on TERMINATED or FAILURE. If you want downstream jobs to run after a timeout (TERMINATED), use condition: success(upstream) OR status(upstream) = 'TERMINATED'.
TERMINATED is often worse than FAILURE because it's silent — no alarm fires by default unless you explicitly monitor for it.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.