AutoSys Job Failure Handling and Restart Procedures
- Always read the error log before restarting — restart into a still-broken environment wastes time
- For single job failures: fix cause → RESTART; for box failures: may need to reset inner jobs to INACTIVE first
- After recovery, verify actual output — SUCCESS status means exit code 0, not that the data is correct
Job failures in AutoSys are inevitable. What separates good operators from great ones is the speed and thoroughness of their recovery process. This article walks through a complete failure handling workflow — from initial alert to verified recovery — the way experienced AutoSys admins actually do it in production.
Step 1: Triage — identify the failure
When paged, your first job is to understand the scope before touching anything.
# 1a. How many jobs are failing? autorep -J % -s FA | wc -l # 1b. Are jobs failing on a specific machine? autorep -J % -s FA | awk '{print $1}' | while read j; do autorep -J $j -d | grep machine done # 1c. Is the agent machine available? autorep -M % # check all machines — any MISSING? # 1d. Are we looking at PEND_MACH instead of FAILURE? autorep -J % -s PE | wc -l
Step 2: Diagnose — read the error log
Never restart a job without reading the error log first. You'll restart it into the same failure and wonder why.
# 2a. Get the error log path autorep -J failed_job -d | grep -E 'std_err|std_out' # 2b. Read the error log tail -100 /logs/autosys/failed_job.err # 2c. Read the AutoSys event log for this job autorep -J failed_job -run 1 # previous run autorep -J failed_job -d # detailed current state # 2d. Check the exit code autostatus -J failed_job
Contents of failed_job.err:
[2026-03-19 22:02:15] ERROR: Cannot connect to database DB_PROD
[2026-03-19 22:02:15] Connection refused: prod-db-01:5432 (timeout after 30s)
[2026-03-19 22:02:15] Script exiting with code 1
Step 3: Fix and restart
After fixing the root cause, restart the job correctly.
# Option A: RESTART (for failed jobs — clean retry) sendevent -E RESTART -J failed_job # Option B: FORCE_STARTJOB (bypasses conditions — use when needed) sendevent -E FORCE_STARTJOB -J failed_job # Option C: If a BOX failed — restart the box and all its inner jobs # First, reset all inner FAILURE jobs: sendevent -E CHANGE_STATUS -J inner_job1 -s INACTIVE sendevent -E CHANGE_STATUS -J inner_job2 -s INACTIVE # Then restart the box: sendevent -E FORCE_STARTJOB -J eod_box # Monitor the restart watch -n 10 'autorep -J failed_job'
03:41:02 — failed_job: RUNNING
03:49:33 — failed_job: SUCCESS
03:49:34 — downstream_job: STARTING */
Step 4: Verify recovery
After restart succeeds, verify the downstream chain completed and the actual data/output is valid.
# Check all downstream jobs in the chain autorep -J eod_% # Confirm the box moved to SUCCESS autorep -J eod_processing_box # Check the actual output — did the script produce the expected file? ls -la /data/output/daily_report_$(date +%Y%m%d).csv wc -l /data/output/daily_report_$(date +%Y%m%d).csv # check row count
eod_extract SU 0
eod_transform SU 0
eod_load SU 0
eod_report SU 0
eod_processing_box SU --
/data/output/daily_report_20260319.csv: 54823 lines
| Scenario | Recovery action | Key consideration |
|---|---|---|
| Single job FAILURE | Fix cause → RESTART or FORCE_STARTJOB | Read error log first |
| BOX with one inner job failed | Fix cause → RESTART inner job | Check if other inner jobs are now blocked |
| BOX in FAILURE with all inner jobs failed | Fix cause → FORCE_STARTJOB the box | May need to reset inner jobs to INACTIVE first |
| PEND_MACH | Fix the agent machine → jobs auto-recover | Don't FORCE_STARTJOB while agent is down |
| TERMINATED by term_run_time | Increase term_run_time → RESTART | Investigate why job ran longer than expected |
🎯 Key Takeaways
- Always read the error log before restarting — restart into a still-broken environment wastes time
- For single job failures: fix cause → RESTART; for box failures: may need to reset inner jobs to INACTIVE first
- After recovery, verify actual output — SUCCESS status means exit code 0, not that the data is correct
- Document what went wrong and the fix in the incident log — recurring failures need root cause analysis
⚠ Common Mistakes to Avoid
- ✕Restarting without reading the error log — results in the same failure with additional delay
- ✕Restarting a job inside a box without considering the box's state — if the box is in FAILURE, inner job restarts may not propagate correctly
- ✕Not verifying the actual output after recovery — AutoSys reporting SUCCESS doesn't mean your data is correct
- ✕Restarting the box when only one inner job failed — often better to restart only the failed inner job rather than the entire box
Interview Questions on This Topic
- QWalk me through how you would handle a failed AutoSys job in production.
- QWhat do you check before restarting a failed AutoSys job?
- QIf a BOX is in FAILURE because one inner job failed, how do you recover?
- QWhat is the difference between restarting a job vs restarting its parent box?
- QAfter restarting a failed job, how do you verify the recovery was successful?
Frequently Asked Questions
What should I do when an AutoSys job fails?
First, read the error log (std_err_file) to understand why it failed. Fix the underlying issue. Then restart with RESTART or FORCE_STARTJOB. Monitor until SUCCESS. Finally, verify the actual output is correct — not just that AutoSys reports success.
How do I restart an AutoSys BOX that failed?
For a BOX failure, you usually need to reset failed inner jobs to INACTIVE first (CHANGE_STATUS -s INACTIVE), then FORCE_STARTJOB the box. Alternatively, restart only the failed inner jobs while leaving succeeded ones in SUCCESS — the box will recheck and proceed.
My AutoSys job keeps failing at the same step — what should I do?
A recurring failure at the same step usually indicates a root cause in the script, its configuration, or an external dependency. Check the error log in detail, add more logging to the script, and engage the application team. Restarting repeatedly without fixing the root cause is a treadmill.
After restarting, how long should I monitor before declaring recovery complete?
Monitor until the full downstream job chain completes successfully — not just the restarted job. Use autorep -J box_name% to check all inner jobs. For critical workflows, also verify the actual output data.
How do I stop downstream jobs from running after a manual fix?
If you've fixed data manually outside AutoSys and don't want downstream jobs running again, either use CHANGE_STATUS to mark jobs SUCCESS (to signal done) or put downstream jobs ON_HOLD before the upstream restart, then release them when appropriate.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.