Mid-level 3 min · March 19, 2026

AutoSys Restart Failures — Diagnosis Before Recovery

Blind restart after ETL failure caused downstream job to process stale data.

N
Naren · Founder
Plain-English first. Then code. Then the interview question.
About
 ● Production Incident 🔎 Debug Guide
Quick Answer
  • Acute triage: check scope (autorep -s FA) before touching any job
  • Always read std_err/std_out logs before restart — 90% of failures are environment, not script
  • RESTART is clean; FORCE_STARTJOB bypasses conditions — use with intent
  • Box failures require resetting inner jobs to INACTIVE before forced restart
  • Verify downstream chain completes — SUCCESS != correct data
  • Recurring failures at same step = root cause analysis, not another restart
Plain-English First

When an AutoSys job fails at 3 AM, you need a clear playbook: find out why it failed, fix the issue, restart correctly, and verify recovery. This article is that playbook.

Job failures in AutoSys are inevitable. What separates good operators from great ones is the speed and thoroughness of their recovery process. This article walks through a complete failure handling workflow — from initial alert to verified recovery — the way experienced AutoSys admins actually do it in production.

Step 1: Triage — identify the failure

When paged, your first job is to understand the scope before touching anything. Don't jump into the first failed job — check how many others are down. A single failure is a script or dependency issue. A cascade points to an environment problem — machine, database, network, or upstream job chain.

triage.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
# 1a. How many jobs are failing?
autorep -J % -s FA | wc -l

# 1b. Are jobs failing on a specific machine?
autorep -J % -s FA | awk '{print $1}' | while read j; do
  autorep -J $j -d | grep machine
done

# 1c. Is the agent machine available?
autorep -M %  # check all machines — any MISSING?

# 1d. Are we looking at PEND_MACH instead of FAILURE?
autorep -J % -s PE | wc -l
Production Insight
A cascade of failures across unrelated jobs means the problem is environmental.
Single job failure is usually script or dependency — check its error log first.
Never restart a job without knowing the scope — you may restart it into a dying machine.
Key Takeaway
Triage scope before touching any job.
One failure = script/input problem; many failures = environment.
Check machine and agent status before restarting.
Failure Recovery Flow Failure Recovery Flow. Step-by-step from alert to verified recovery · Alert fires · alarm_if_fail · Check scope · autorep -J % -s FA · Read error log THECODEFORGE.IOFailure Recovery FlowStep-by-step from alert to verified recovery Alert firesalarm_if_fail Check scopeautorep -J % -s FA Read error logcat std_err_file Fix root causecode/infra fix RESTART jobsendevent RESTART Verify outputcheck dataTHECODEFORGE.IO
thecodeforge.io
Failure Recovery Flow
Autosys Job Failure Handling Restart

Step 2: Diagnose — read the error log

Never restart a job without reading the error log first. You'll restart it into the same failure and wonder why. AutoSys captures standard error and standard output. Always check both — sometimes the script succeeds but reports errors to stdout that don't affect exit code.

diagnose.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
# 2a. Get the error log path
autorep -J failed_job -d | grep -E 'std_err|std_out'

# 2b. Read the error log
tail -100 /logs/autosys/failed_job.err

# 2c. Read the AutoSys event log for this job
autorep -J failed_job -run 1   # previous run
autorep -J failed_job -d       # detailed current state

# 2d. Check the exit code
autostatus -J failed_job
Output
std_err_file: /logs/autosys/failed_job.err
Contents of failed_job.err:
[2026-03-19 22:02:15] ERROR: Cannot connect to database DB_PROD
[2026-03-19 22:02:15] Connection refused: prod-db-01:5432 (timeout after 30s)
[2026-03-19 22:02:15] Script exiting with code 1
Always read the error log BEFORE restarting
If the database is down, restarting the job gets you nothing. Fix the underlying issue first. Reading the error log takes 30 seconds; diagnosing why a restart failed again takes much longer.
Production Insight
In production, error logs are your first source of truth — don't rely on exit codes alone.
Scripts often exit 0 even when they produce incorrect output.
Check both std_err and std_out — the real problem may be hidden in the output log.
Key Takeaway
Always read the error log before restart.
Check both std_err and std_out.
The error log is free — ignoring it costs time and pager stress.

Step 3: Fix and restart correctly

After fixing the root cause, restart the job using the right command. RESTART is the safe option — it clears the failure status and restarts the job with the same conditions (dependencies, machine requirements). FORCE_STARTJOB bypasses all conditions — use it when you need to start a job that is still blocked by unmet dependencies (e.g., after manually fixing data). For box failures, never restart the box without first resetting the failed inner jobs — otherwise the box will immediately re-fail.

restart.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Option A: RESTART (for failed jobs — clean retry)
sendevent -E RESTART -J failed_job

# Option B: FORCE_STARTJOB (bypasses conditions — use when needed)
sendevent -E FORCE_STARTJOB -J failed_job

# Option C: If a BOX failed — restart the box and all its inner jobs
# First, reset all inner FAILURE jobs:
sendevent -E CHANGE_STATUS -J inner_job1 -s INACTIVE
sendevent -E CHANGE_STATUS -J inner_job2 -s INACTIVE
# Then restart the box:
sendevent -E FORCE_STARTJOB -J eod_box

# Monitor the restart
watch -n 10 'autorep -J failed_job'
Output
/* 03:41:00 — failed_job: STARTING
03:41:02 — failed_job: RUNNING
03:49:33 — failed_job: SUCCESS
03:49:34 — downstream_job: STARTING */
Production Insight
RESTART respects job dependencies; FORCE_STARTJOB does not.
Resetting box inner jobs to INACTIVE prevents the box from re-failing immediately.
Always monitor the restart: a job that goes directly from RUNNING to FAILURE in seconds indicates an unfixed issue.
Key Takeaway
RESTART = safe retry; FORCE_STARTJOB = bypass conditions.
Boxes need inner INACTIVE reset first.
Monitor the restart — quick fail means root cause still present.

Step 4: Verify recovery — beyond the SUCCESS status

A job showing SUCCESS in AutoSys means its exit code was 0. It does not mean the data is correct or the downstream chain completed. Verification is the step that separates reliable operators from ones who get called back an hour later. Check that all downstream jobs in the chain succeeded, and then validate the actual output: file size, row count, or a simple data check.

verify_recovery.shBASH
1
2
3
4
5
6
7
8
9
# Check all downstream jobs in the chain
autorep -J eod_%

# Confirm the box moved to SUCCESS
autorep -J eod_processing_box

# Check the actual output — did the script produce the expected file?
ls -la /data/output/daily_report_$(date +%Y%m%d).csv
wc -l /data/output/daily_report_$(date +%Y%m%d).csv  # check row count
Output
Job Name ST Exit
eod_extract SU 0
eod_transform SU 0
eod_load SU 0
eod_report SU 0
eod_processing_box SU --
/data/output/daily_report_20260319.csv: 54823 lines
Production Insight
SUCCESS in AutoSys means exit code 0, not correct data.
A job that exits 0 with an empty file is still SUCCESS — and will corrupt downstream processes.
Always validate output for critical jobs — row count, checksum, or a sample read.
Key Takeaway
SUCCESS != correct data.
Always verify downstream chain and actual output.
An empty or corrupt file with exit code 0 is a silent production killer.

Step 5: Post-recovery actions — document and escalate

Recovery isn't complete until you've recorded what happened and why. This step prevents recurring failures and helps your team learn. Write a brief incident summary: what failed, root cause, fix applied, time to recover, and any follow-up needed (e.g., increase term_run_time, add a healthcheck to the database monitoring). If the failure indicates a systemic problem (e.g., multiple jobs using the same broken dependency), escalate to the appropriate team. Finally, update any runbooks or monitoring rules that could have caught the issue earlier.

post_recovery.shBASH
1
2
3
4
5
6
7
8
9
10
# Log the incident to a shared file or alerting system
echo "$(date): Job eod_extract failed — DB_PROD unreachable" >> /var/log/autosys_incidents.log
echo "Root cause: Network switch failure. Fixed by ops team." >> /var/log/autosys_incidents.log
echo "Recovery time: 12 min" >> /var/log/autosys_incidents.log

# If needed, update the job's term_run_time or add a retry policy
sendevent -E CHANGE_STATUS -J eod_extract -a term_run_time=1200

# Notify stakeholder
mail -s "AutoSys incident report: eod_extract" ops-team@company.com < /var/log/autosys_incidents.log
Why post-recovery matters
Every incident is a chance to improve your system. By documenting root cause and resolution, you build a knowledge base that reduces MTTR for future failures. Escalate systemic issues to prevent recurrence across the entire batch schedule.
Production Insight
Undeclared failures are repeat failures.
If you don't record the root cause, the next person will restart into the same broken environment.
Post-recovery documentation is the foundation of operational maturity.
Key Takeaway
Always document the incident.
Escalate systemic issues.
Post-recovery is when you prevent the next failure.
● Production incidentPOST-MORTEMseverity: high

The Silent Restart That Cost a Trading Window

Symptom
After a failed ETL job, operator issued RESTART immediately. Job failed again with same error. A downstream job that had already started (because the box status was SUCCESS from a previous run) then crashed because it received incomplete data.
Assumption
The failure was a transient script error — restart would fix it.
Root cause
The source database was unreachable due to a network switch failure. The second restart caused a downstream job to process stale data, corrupting the daily report.
Fix
Read error log first: autorep -J job -d | grep std_err. Saw connection refused. Fixed network path. Then RESTART after database was reachable. Used CHANGE_STATUS to mark affected downstream jobs as ON_ICE before restarting.
Key lesson
  • Always read the error log before any restart — the root cause is rarely the script.
  • Know the downstream dependency chain and protect it from partial data.
  • A forced restart without diagnosis is gambling with production.
Production debug guideSymptom → Action for common failure scenarios4 entries
Symptom · 01
Job shows FAILURE but no error log found
Fix
Check std_err_file path in job definition. Also check if the script redirected output elsewhere. Use autorep -J job -q to see full attributes.
Symptom · 02
Box in FAILURE, but inner jobs all show SUCCESS
Fix
The box failed due to a condition mismatch or a dependency failure outside the box. Check autorep -J box -d for exit code and look for NRI or TERM status on box.
Symptom · 03
Job stuck in STARTING or RUNNING for too long
Fix
Check agent machine availability (autorep -M), kill the stuck process with sendevent -E FORCE_STARTJOB -J job (yes, it will restart after kill). Also check term_run_time.
Symptom · 04
PEND_MACH with no agent issues
Fix
Check global variables (autorep -g), job dependency conditions (autorep -J job -d | grep condition), and run_window. Use sendevent -E FORCE_STARTJOB only as last resort.
★ Quick Debug Cheat SheetFor the 3 AM pager — commands to understand scope and act.
Paged for job failure
Immediate action
Count failed jobs
Commands
autorep -J % -s FA | wc -l
autorep -J % -s FA | head -20
Fix now
If single job: read its error log. If many: check machine availability.
Job failed, need error log+
Immediate action
Get error log path
Commands
autorep -J job -d | grep std_err
tail -100 /path/to/err_file
Fix now
Identify root cause: connection issue, script bug, resource shortage.
Box is FAILURE, need to recover+
Immediate action
List inner job statuses
Commands
autorep -J box_name%
autorep -J inner_failed -d
Fix now
Reset failed inner jobs to INACTIVE, then FORCE_STARTJOB the box.
Verifying recovery after restart+
Immediate action
Check restarted job status
Commands
autorep -J job
autorep -J job_name%
Fix now
If SUCCESS, verify actual output file/row count. If still FAILURE, re-diagnose.
Failure recovery decision table
ScenarioRecovery actionKey consideration
Single job FAILUREFix cause → RESTART or FORCE_STARTJOBRead error log first
BOX with one inner job failedFix cause → RESTART inner jobCheck if other inner jobs are now blocked
BOX in FAILURE with all inner jobs failedFix cause → FORCE_STARTJOB the boxMay need to reset inner jobs to INACTIVE first
PEND_MACHFix the agent machine → jobs auto-recoverDon't FORCE_STARTJOB while agent is down
TERMINATED by term_run_timeIncrease term_run_time → RESTARTInvestigate why job ran longer than expected

Key takeaways

1
Always read the error log before restarting
restart into a still-broken environment wastes time
2
For single job failures
fix cause → RESTART; for box failures: may need to reset inner jobs to INACTIVE first
3
After recovery, verify actual output
SUCCESS status means exit code 0, not that the data is correct
4
Document what went wrong and the fix in the incident log
recurring failures need root cause analysis

Common mistakes to avoid

4 patterns
×

Restarting without reading the error log

Symptom
Job fails again with same error; additional delays and pager fatigue.
Fix
Always run autorep -J job -d to find error log path, then tail the log before any restart.
×

Restarting a job inside a box without considering the box's state

Symptom
Inner job restarts but box remains FAILURE; downstream jobs not triggered.
Fix
Check box status with autorep -J box -d. If box is FAILURE, reset inner jobs to INACTIVE and restart the box, not just the inner job.
×

Not verifying actual output after recovery

Symptom
AutoSys reports SUCCESS but data is missing or corrupt; downstream failures happen later.
Fix
After restart, check actual output file size/row count and run a sample validation before declaring success.
×

Restarting the entire box when only one inner job failed

Symptom
Other inner jobs that completed successfully restart unnecessarily, extending recovery time and risking data duplication.
Fix
Restart only the failed inner job. If the box status allows it, the box will re-evaluate and succeed once that inner job completes.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Walk me through how you would handle a failed AutoSys job in production.
Q02JUNIOR
What do you check before restarting a failed AutoSys job?
Q03SENIOR
If a BOX is in FAILURE because one inner job failed, how do you recover?
Q04SENIOR
What is the difference between restarting a job vs restarting its parent...
Q05SENIOR
After restarting a failed job, how do you verify the recovery was succes...
Q01 of 05SENIOR

Walk me through how you would handle a failed AutoSys job in production.

ANSWER
First, triage scope: check how many jobs failed (autorep -s FA). If many, investigate environment — machine, database, network. If single, read error log via autorep -d to get std_err path, then tail the log. Fix root cause (e.g., restart database, correct script input). Then issue RESTART for a clean retry, or FORCE_STARTJOB if conditions are blocking. Monitor restart with autorep -J and watch -n 10. After success, verify downstream chain and actual output. Finally, document the incident.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What should I do when an AutoSys job fails?
02
How do I restart an AutoSys BOX that failed?
03
My AutoSys job keeps failing at the same step — what should I do?
04
After restarting, how long should I monitor before declaring recovery complete?
05
How do I stop downstream jobs from running after a manual fix?
COMPLETE GUIDE
The Complete AutoSys Workload Automation Guide for Engineers →

JIL syntax, sendevent, autorep, box jobs, file watchers, scheduling, HA, security, cloud workload automation, and 22 interview questions — the definitive AutoSys reference for production engineers.

🔥

That's AutoSys. Mark it forged?

3 min read · try the examples if you haven't

Previous
AutoSys Fault Tolerance and Recovery
26 / 30 · AutoSys
Next
AutoSys Real-World Patterns and Best Practices