DevOps Advanced

AutoSys Job Failure Handling and Restart Procedures

📅 March 19, 2026 ⏱ 3 min read 🎯 Advanced

Where developers are forged. · Structured learning · Free forever.

📍 Part of: AutoSys → Topic 26 of 30

Step-by-step AutoSys failure handling: how to investigate a failed job, read error logs, restart correctly, handle box failures, and run post-recovery verification.

🔥 Advanced — solid DevOps foundation required

In this tutorial, you'll learn

Step-by-step AutoSys failure handling: how to investigate a failed job, read error logs, restart correctly, handle box failures, and run post-recovery verification.

Always read the error log before restarting — restart into a still-broken environment wastes time
For single job failures: fix cause → RESTART; for box failures: may need to reset inner jobs to INACTIVE first
After recovery, verify actual output — SUCCESS status means exit code 0, not that the data is correct

thecodeforge.io

Failure Recovery Flow

Autosys Job Failure Handling Restart

✦ Plain-English analogy ✦ Real code with output ✦ Interview questions

⚡Quick Answer

Acute triage: check scope (autorep -s FA) before touching any job
Always read std_err/std_out logs before restart — 90% of failures are environment, not script
RESTART is clean; FORCE_STARTJOB bypasses conditions — use with intent
Box failures require resetting inner jobs to INACTIVE before forced restart
Verify downstream chain completes — SUCCESS != correct data
Recurring failures at same step = root cause analysis, not another restart

🚨 START HERE

Quick Debug Cheat Sheet

For the 3 AM pager — commands to understand scope and act.

🟡

Paged for job failure

Immediate ActionCount failed jobs

Commands

autorep -J % -s FA | wc -l

autorep -J % -s FA | head -20

Fix NowIf single job: read its error log. If many: check machine availability.

🟡

Job failed, need error log

Immediate ActionGet error log path

Commands

autorep -J job -d | grep std_err

tail -100 /path/to/err_file

Fix NowIdentify root cause: connection issue, script bug, resource shortage.

🟡

Box is FAILURE, need to recover

Immediate ActionList inner job statuses

Commands

autorep -J box_name%

autorep -J inner_failed -d

Fix NowReset failed inner jobs to INACTIVE, then FORCE_STARTJOB the box.

🟡

Verifying recovery after restart

Immediate ActionCheck restarted job status

Commands

autorep -J job

autorep -J job_name%

Fix NowIf SUCCESS, verify actual output file/row count. If still FAILURE, re-diagnose.

Production Incident

The Silent Restart That Cost a Trading Window

An operator restarted a failed job without reading the error log. The database was down, and the job failed again — this time taking down a downstream process that wasn't in the original chain.

SymptomAfter a failed ETL job, operator issued RESTART immediately. Job failed again with same error. A downstream job that had already started (because the box status was SUCCESS from a previous run) then crashed because it received incomplete data.

AssumptionThe failure was a transient script error — restart would fix it.

Root causeThe source database was unreachable due to a network switch failure. The second restart caused a downstream job to process stale data, corrupting the daily report.

FixRead error log first: autorep -J job -d | grep std_err. Saw connection refused. Fixed network path. Then RESTART after database was reachable. Used CHANGE_STATUS to mark affected downstream jobs as ON_ICE before restarting.

Key Lesson

Always read the error log before any restart — the root cause is rarely the script.Know the downstream dependency chain and protect it from partial data.A forced restart without diagnosis is gambling with production.

Production Debug Guide

Symptom → Action for common failure scenarios

Job shows FAILURE but no error log found→Check std_err_file path in job definition. Also check if the script redirected output elsewhere. Use autorep -J job -q to see full attributes.

Box in FAILURE, but inner jobs all show SUCCESS→The box failed due to a condition mismatch or a dependency failure outside the box. Check autorep -J box -d for exit code and look for NRI or TERM status on box.

Job stuck in STARTING or RUNNING for too long→Check agent machine availability (autorep -M), kill the stuck process with sendevent -E FORCE_STARTJOB -J job (yes, it will restart after kill). Also check term_run_time.

PEND_MACH with no agent issues→Check global variables (autorep -g), job dependency conditions (autorep -J job -d | grep condition), and run_window. Use sendevent -E FORCE_STARTJOB only as last resort.

Job failures in AutoSys are inevitable. What separates good operators from great ones is the speed and thoroughness of their recovery process. This article walks through a complete failure handling workflow — from initial alert to verified recovery — the way experienced AutoSys admins actually do it in production.

Step 1: Triage — identify the failure

When paged, your first job is to understand the scope before touching anything. Don't jump into the first failed job — check how many others are down. A single failure is a script or dependency issue. A cascade points to an environment problem — machine, database, network, or upstream job chain.

triage.sh · BASH

12345678910111213

# 1a. How many jobs are failing?
autorep -J % -s FA | wc -l

# 1b. Are jobs failing on a specific machine?
autorep -J % -s FA | awk '{print $1}' | while read j; do
  autorep -J $j -d | grep machine
done

# 1c. Is the agent machine available?
autorep -M %  # check all machines — any MISSING?

# 1d. Are we looking at PEND_MACH instead of FAILURE?
autorep -J % -s PE | wc -l

📊 Production Insight

A cascade of failures across unrelated jobs means the problem is environmental.

Single job failure is usually script or dependency — check its error log first.

Never restart a job without knowing the scope — you may restart it into a dying machine.

🎯 Key Takeaway

Triage scope before touching any job.

One failure = script/input problem; many failures = environment.

Check machine and agent status before restarting.

Step 2: Diagnose — read the error log

Never restart a job without reading the error log first. You'll restart it into the same failure and wonder why. AutoSys captures standard error and standard output. Always check both — sometimes the script succeeds but reports errors to stdout that don't affect exit code.

diagnose.sh · BASH

123456789101112

# 2a. Get the error log path
autorep -J failed_job -d | grep -E 'std_err|std_out'

# 2b. Read the error log
tail -100 /logs/autosys/failed_job.err

# 2c. Read the AutoSys event log for this job
autorep -J failed_job -run 1   # previous run
autorep -J failed_job -d       # detailed current state

# 2d. Check the exit code
autostatus -J failed_job

▶ Output

std_err_file: /logs/autosys/failed_job.err

Contents of failed_job.err:
[2026-03-19 22:02:15] ERROR: Cannot connect to database DB_PROD
[2026-03-19 22:02:15] Connection refused: prod-db-01:5432 (timeout after 30s)
[2026-03-19 22:02:15] Script exiting with code 1

⚠ Always read the error log BEFORE restarting

If the database is down, restarting the job gets you nothing. Fix the underlying issue first. Reading the error log takes 30 seconds; diagnosing why a restart failed again takes much longer.

📊 Production Insight

In production, error logs are your first source of truth — don't rely on exit codes alone.

Scripts often exit 0 even when they produce incorrect output.

Check both std_err and std_out — the real problem may be hidden in the output log.

🎯 Key Takeaway

Always read the error log before restart.

Check both std_err and std_out.

The error log is free — ignoring it costs time and pager stress.

Step 3: Fix and restart correctly

After fixing the root cause, restart the job using the right command. RESTART is the safe option — it clears the failure status and restarts the job with the same conditions (dependencies, machine requirements). FORCE_STARTJOB bypasses all conditions — use it when you need to start a job that is still blocked by unmet dependencies (e.g., after manually fixing data). For box failures, never restart the box without first resetting the failed inner jobs — otherwise the box will immediately re-fail.

restart.sh · BASH

123456789101112131415

# Option A: RESTART (for failed jobs — clean retry)
sendevent -E RESTART -J failed_job

# Option B: FORCE_STARTJOB (bypasses conditions — use when needed)
sendevent -E FORCE_STARTJOB -J failed_job

# Option C: If a BOX failed — restart the box and all its inner jobs
# First, reset all inner FAILURE jobs:
sendevent -E CHANGE_STATUS -J inner_job1 -s INACTIVE
sendevent -E CHANGE_STATUS -J inner_job2 -s INACTIVE
# Then restart the box:
sendevent -E FORCE_STARTJOB -J eod_box

# Monitor the restart
watch -n 10 'autorep -J failed_job'

▶ Output

/* 03:41:00 — failed_job: STARTING
03:41:02 — failed_job: RUNNING
03:49:33 — failed_job: SUCCESS
03:49:34 — downstream_job: STARTING */

📊 Production Insight

RESTART respects job dependencies; FORCE_STARTJOB does not.

Resetting box inner jobs to INACTIVE prevents the box from re-failing immediately.

Always monitor the restart: a job that goes directly from RUNNING to FAILURE in seconds indicates an unfixed issue.

🎯 Key Takeaway

RESTART = safe retry; FORCE_STARTJOB = bypass conditions.

Boxes need inner INACTIVE reset first.

Monitor the restart — quick fail means root cause still present.

Step 4: Verify recovery — beyond the SUCCESS status

A job showing SUCCESS in AutoSys means its exit code was 0. It does not mean the data is correct or the downstream chain completed. Verification is the step that separates reliable operators from ones who get called back an hour later. Check that all downstream jobs in the chain succeeded, and then validate the actual output: file size, row count, or a simple data check.

verify_recovery.sh · BASH

123456789

# Check all downstream jobs in the chain
autorep -J eod_%

# Confirm the box moved to SUCCESS
autorep -J eod_processing_box

# Check the actual output — did the script produce the expected file?
ls -la /data/output/daily_report_$(date +%Y%m%d).csv
wc -l /data/output/daily_report_$(date +%Y%m%d).csv  # check row count

▶ Output

Job Name ST Exit
eod_extract SU 0
eod_transform SU 0
eod_load SU 0
eod_report SU 0
eod_processing_box SU --

/data/output/daily_report_20260319.csv: 54823 lines

📊 Production Insight

SUCCESS in AutoSys means exit code 0, not correct data.

A job that exits 0 with an empty file is still SUCCESS — and will corrupt downstream processes.

Always validate output for critical jobs — row count, checksum, or a sample read.

🎯 Key Takeaway

SUCCESS != correct data.

Always verify downstream chain and actual output.

An empty or corrupt file with exit code 0 is a silent production killer.

Step 5: Post-recovery actions — document and escalate

Recovery isn't complete until you've recorded what happened and why. This step prevents recurring failures and helps your team learn. Write a brief incident summary: what failed, root cause, fix applied, time to recover, and any follow-up needed (e.g., increase term_run_time, add a healthcheck to the database monitoring). If the failure indicates a systemic problem (e.g., multiple jobs using the same broken dependency), escalate to the appropriate team. Finally, update any runbooks or monitoring rules that could have caught the issue earlier.

post_recovery.sh · BASH

12345678910

# Log the incident to a shared file or alerting system
echo "$(date): Job eod_extract failed — DB_PROD unreachable" >> /var/log/autosys_incidents.log
echo "Root cause: Network switch failure. Fixed by ops team." >> /var/log/autosys_incidents.log
echo "Recovery time: 12 min" >> /var/log/autosys_incidents.log

# If needed, update the job's term_run_time or add a retry policy
sendevent -E CHANGE_STATUS -J eod_extract -a term_run_time=1200

# Notify stakeholder
mail -s "AutoSys incident report: eod_extract" ops-team@company.com < /var/log/autosys_incidents.log

🔥Why post-recovery matters

Every incident is a chance to improve your system. By documenting root cause and resolution, you build a knowledge base that reduces MTTR for future failures. Escalate systemic issues to prevent recurrence across the entire batch schedule.

📊 Production Insight

Undeclared failures are repeat failures.

If you don't record the root cause, the next person will restart into the same broken environment.

Post-recovery documentation is the foundation of operational maturity.

🎯 Key Takeaway

Always document the incident.

Escalate systemic issues.

Post-recovery is when you prevent the next failure.

🗂 Failure recovery decision table

Scenario	Recovery action	Key consideration
Single job FAILURE	Fix cause → RESTART or FORCE_STARTJOB	Read error log first
BOX with one inner job failed	Fix cause → RESTART inner job	Check if other inner jobs are now blocked
BOX in FAILURE with all inner jobs failed	Fix cause → FORCE_STARTJOB the box	May need to reset inner jobs to INACTIVE first
PEND_MACH	Fix the agent machine → jobs auto-recover	Don't FORCE_STARTJOB while agent is down
TERMINATED by term_run_time	Increase term_run_time → RESTART	Investigate why job ran longer than expected

🎯 Key Takeaways

Always read the error log before restarting — restart into a still-broken environment wastes time
For single job failures: fix cause → RESTART; for box failures: may need to reset inner jobs to INACTIVE first
After recovery, verify actual output — SUCCESS status means exit code 0, not that the data is correct
Document what went wrong and the fix in the incident log — recurring failures need root cause analysis

⚠ Common Mistakes to Avoid

✕Restarting without reading the error log

Symptom

Job fails again with same error; additional delays and pager fatigue.

Fix

Always run autorep -J job -d to find error log path, then tail the log before any restart.

✕Restarting a job inside a box without considering the box's state

Symptom

Inner job restarts but box remains FAILURE; downstream jobs not triggered.

Fix

Check box status with autorep -J box -d. If box is FAILURE, reset inner jobs to INACTIVE and restart the box, not just the inner job.

✕Not verifying actual output after recovery

Symptom

AutoSys reports SUCCESS but data is missing or corrupt; downstream failures happen later.

Fix

After restart, check actual output file size/row count and run a sample validation before declaring success.

✕Restarting the entire box when only one inner job failed

Symptom

Other inner jobs that completed successfully restart unnecessarily, extending recovery time and risking data duplication.

Fix

Restart only the failed inner job. If the box status allows it, the box will re-evaluate and succeed once that inner job completes.

Interview Questions on This Topic

QWalk me through how you would handle a failed AutoSys job in production.Mid-levelReveal
First, triage scope: check how many jobs failed (autorep -s FA). If many, investigate environment — machine, database, network. If single, read error log via autorep -d to get std_err path, then tail the log. Fix root cause (e.g., restart database, correct script input). Then issue RESTART for a clean retry, or FORCE_STARTJOB if conditions are blocking. Monitor restart with autorep -J and watch -n 10. After success, verify downstream chain and actual output. Finally, document the incident.
QWhat do you check before restarting a failed AutoSys job?JuniorReveal
First, check the error log (std_err_file) to understand why it failed. Second, check the machine status (autorep -M) to ensure the agent is up. Third, check if the job's dependencies are satisfied (autorep -J job -d | grep condition). Fourth, assess the box status — if the job is inside a box, the box must not be FAILURE itself (otherwise restarting only the inner job won't propagate). Finally, if the failure was due to a timeout, consider increasing term_run_time before restart.
QIf a BOX is in FAILURE because one inner job failed, how do you recover?SeniorReveal
Option A: Reset the failed inner job to INACTIVE (sendevent -E CHANGE_STATUS -J inner_job -s INACTIVE), then restart the inner job with RESTART or FORCE_STARTJOB. The box will re-evaluate and can succeed. Option B: If the box's starting conditions are complex, reset all failed inner jobs to INACTIVE and then FORCE_STARTJOB the box, which will start all inner jobs that are INACTIVE. Never restart the box without handling the inner failures — the box will immediately go to FAILURE again because the inner job is still FAILURE.
QWhat is the difference between restarting a job vs restarting its parent box?SeniorReveal
Restarting a job (sendevent -E RESTART -J job) clears its failure status, respects its dependencies, and runs it again individually. Restarting a box (sendevent -E FORCE_STARTJOB -J box) restarts the entire batch of inner jobs that are INACTIVE or FAILURE — it will ignore conditions and run all inner jobs regardless of their previous states. Choosing between them depends on whether you want to re-run only the failed component (faster, lower risk) or need to re-execute the entire workflow (when other inner jobs also need refreshing or when the box status is corrupted).
QAfter restarting a failed job, how do you verify the recovery was successful?Mid-levelReveal
Check the job status (autorep -J job) for SUCCESS. Then check all downstream jobs that depend on it — use autorep with a wildcard pattern. Verify the actual output data: file size, row count, or content sample. Finally, confirm the box status if applicable (autorep -J box). A job may exit with code 0 but produce an empty file; that's why data validation is essential.

Frequently Asked Questions

What should I do when an AutoSys job fails?

First, read the error log (std_err_file) to understand why it failed. Fix the underlying issue. Then restart with RESTART or FORCE_STARTJOB. Monitor until SUCCESS. Finally, verify the actual output is correct — not just that AutoSys reports success.

How do I restart an AutoSys BOX that failed?

For a BOX failure, you usually need to reset failed inner jobs to INACTIVE first (CHANGE_STATUS -s INACTIVE), then FORCE_STARTJOB the box. Alternatively, restart only the failed inner jobs while leaving succeeded ones in SUCCESS — the box will recheck and proceed.

My AutoSys job keeps failing at the same step — what should I do?

A recurring failure at the same step usually indicates a root cause in the script, its configuration, or an external dependency. Check the error log in detail, add more logging to the script, and engage the application team. Restarting repeatedly without fixing the root cause is a treadmill.

After restarting, how long should I monitor before declaring recovery complete?

Monitor until the full downstream job chain completes successfully — not just the restarted job. Use autorep -J box_name% to check all inner jobs. For critical workflows, also verify the actual output data.

How do I stop downstream jobs from running after a manual fix?

If you've fixed data manually outside AutoSys and don't want downstream jobs running again, either use CHANGE_STATUS to mark jobs SUCCESS (to signal done) or put downstream jobs ON_HOLD before the upstream restart, then release them when appropriate.

🔥

Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

About Naren Get in touch

Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged