Home DevOps AutoSys Job Failure Handling and Restart Procedures

AutoSys Job Failure Handling and Restart Procedures

Where developers are forged. · Structured learning · Free forever.
📍 Part of: AutoSys → Topic 26 of 30
Step-by-step AutoSys failure handling: how to investigate a failed job, read error logs, restart correctly, handle box failures, and run post-recovery verification.
🔥 Advanced — solid DevOps foundation required
In this tutorial, you'll learn:
  • Always read the error log before restarting — restart into a still-broken environment wastes time
  • For single job failures: fix cause → RESTART; for box failures: may need to reset inner jobs to INACTIVE first
  • After recovery, verify actual output — SUCCESS status means exit code 0, not that the data is correct
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
⚡ Quick Answer
When an AutoSys job fails at 3 AM, you need a clear playbook: find out why it failed, fix the issue, restart correctly, and verify recovery. This article is that playbook.

Job failures in AutoSys are inevitable. What separates good operators from great ones is the speed and thoroughness of their recovery process. This article walks through a complete failure handling workflow — from initial alert to verified recovery — the way experienced AutoSys admins actually do it in production.

Step 1: Triage — identify the failure

When paged, your first job is to understand the scope before touching anything.

triage.sh · BASH
12345678910111213
# 1a. How many jobs are failing?
autorep -J % -s FA | wc -l

# 1b. Are jobs failing on a specific machine?
autorep -J % -s FA | awk '{print $1}' | while read j; do
  autorep -J $j -d | grep machine
done

# 1c. Is the agent machine available?
autorep -M %  # check all machines — any MISSING?

# 1d. Are we looking at PEND_MACH instead of FAILURE?
autorep -J % -s PE | wc -l

Step 2: Diagnose — read the error log

Never restart a job without reading the error log first. You'll restart it into the same failure and wonder why.

diagnose.sh · BASH
123456789101112
# 2a. Get the error log path
autorep -J failed_job -d | grep -E 'std_err|std_out'

# 2b. Read the error log
tail -100 /logs/autosys/failed_job.err

# 2c. Read the AutoSys event log for this job
autorep -J failed_job -run 1   # previous run
autorep -J failed_job -d       # detailed current state

# 2d. Check the exit code
autostatus -J failed_job
▶ Output
std_err_file: /logs/autosys/failed_job.err

Contents of failed_job.err:
[2026-03-19 22:02:15] ERROR: Cannot connect to database DB_PROD
[2026-03-19 22:02:15] Connection refused: prod-db-01:5432 (timeout after 30s)
[2026-03-19 22:02:15] Script exiting with code 1
⚠️
Always read the error log BEFORE restartingIf the database is down, restarting the job gets you nothing. Fix the underlying issue first. Reading the error log takes 30 seconds; diagnosing why a restart failed again takes much longer.

Step 3: Fix and restart

After fixing the root cause, restart the job correctly.

restart.sh · BASH
123456789101112131415
# Option A: RESTART (for failed jobs — clean retry)
sendevent -E RESTART -J failed_job

# Option B: FORCE_STARTJOB (bypasses conditions — use when needed)
sendevent -E FORCE_STARTJOB -J failed_job

# Option C: If a BOX failed — restart the box and all its inner jobs
# First, reset all inner FAILURE jobs:
sendevent -E CHANGE_STATUS -J inner_job1 -s INACTIVE
sendevent -E CHANGE_STATUS -J inner_job2 -s INACTIVE
# Then restart the box:
sendevent -E FORCE_STARTJOB -J eod_box

# Monitor the restart
watch -n 10 'autorep -J failed_job'
▶ Output
/* 03:41:00 — failed_job: STARTING
03:41:02 — failed_job: RUNNING
03:49:33 — failed_job: SUCCESS
03:49:34 — downstream_job: STARTING */

Step 4: Verify recovery

After restart succeeds, verify the downstream chain completed and the actual data/output is valid.

verify_recovery.sh · BASH
123456789
# Check all downstream jobs in the chain
autorep -J eod_%

# Confirm the box moved to SUCCESS
autorep -J eod_processing_box

# Check the actual output — did the script produce the expected file?
ls -la /data/output/daily_report_$(date +%Y%m%d).csv
wc -l /data/output/daily_report_$(date +%Y%m%d).csv  # check row count
▶ Output
Job Name ST Exit
eod_extract SU 0
eod_transform SU 0
eod_load SU 0
eod_report SU 0
eod_processing_box SU --

/data/output/daily_report_20260319.csv: 54823 lines
ScenarioRecovery actionKey consideration
Single job FAILUREFix cause → RESTART or FORCE_STARTJOBRead error log first
BOX with one inner job failedFix cause → RESTART inner jobCheck if other inner jobs are now blocked
BOX in FAILURE with all inner jobs failedFix cause → FORCE_STARTJOB the boxMay need to reset inner jobs to INACTIVE first
PEND_MACHFix the agent machine → jobs auto-recoverDon't FORCE_STARTJOB while agent is down
TERMINATED by term_run_timeIncrease term_run_time → RESTARTInvestigate why job ran longer than expected

🎯 Key Takeaways

  • Always read the error log before restarting — restart into a still-broken environment wastes time
  • For single job failures: fix cause → RESTART; for box failures: may need to reset inner jobs to INACTIVE first
  • After recovery, verify actual output — SUCCESS status means exit code 0, not that the data is correct
  • Document what went wrong and the fix in the incident log — recurring failures need root cause analysis

⚠ Common Mistakes to Avoid

  • Restarting without reading the error log — results in the same failure with additional delay
  • Restarting a job inside a box without considering the box's state — if the box is in FAILURE, inner job restarts may not propagate correctly
  • Not verifying the actual output after recovery — AutoSys reporting SUCCESS doesn't mean your data is correct
  • Restarting the box when only one inner job failed — often better to restart only the failed inner job rather than the entire box

Interview Questions on This Topic

  • QWalk me through how you would handle a failed AutoSys job in production.
  • QWhat do you check before restarting a failed AutoSys job?
  • QIf a BOX is in FAILURE because one inner job failed, how do you recover?
  • QWhat is the difference between restarting a job vs restarting its parent box?
  • QAfter restarting a failed job, how do you verify the recovery was successful?

Frequently Asked Questions

What should I do when an AutoSys job fails?

First, read the error log (std_err_file) to understand why it failed. Fix the underlying issue. Then restart with RESTART or FORCE_STARTJOB. Monitor until SUCCESS. Finally, verify the actual output is correct — not just that AutoSys reports success.

How do I restart an AutoSys BOX that failed?

For a BOX failure, you usually need to reset failed inner jobs to INACTIVE first (CHANGE_STATUS -s INACTIVE), then FORCE_STARTJOB the box. Alternatively, restart only the failed inner jobs while leaving succeeded ones in SUCCESS — the box will recheck and proceed.

My AutoSys job keeps failing at the same step — what should I do?

A recurring failure at the same step usually indicates a root cause in the script, its configuration, or an external dependency. Check the error log in detail, add more logging to the script, and engage the application team. Restarting repeatedly without fixing the root cause is a treadmill.

After restarting, how long should I monitor before declaring recovery complete?

Monitor until the full downstream job chain completes successfully — not just the restarted job. Use autorep -J box_name% to check all inner jobs. For critical workflows, also verify the actual output data.

How do I stop downstream jobs from running after a manual fix?

If you've fixed data manually outside AutoSys and don't want downstream jobs running again, either use CHANGE_STATUS to mark jobs SUCCESS (to signal done) or put downstream jobs ON_HOLD before the upstream restart, then release them when appropriate.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousAutoSys Fault Tolerance and RecoveryNext →AutoSys Real-World Patterns and Best Practices
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged