Mid-level 4 min · March 19, 2026
AutoSys Job Failure Handling and Restart

AutoSys Restart Failures — Diagnosis Before Recovery

Blind restart after ETL failure caused downstream job to process stale data.

N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.

Follow
Production
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Acute triage: check scope (autorep -s FA) before touching any job
  • Always read std_err/std_out logs before restart — 90% of failures are environment, not script
  • RESTART is clean; FORCE_STARTJOB bypasses conditions — use with intent
  • Box failures require resetting inner jobs to INACTIVE before forced restart
  • Verify downstream chain completes — SUCCESS != correct data
  • Recurring failures at same step = root cause analysis, not another restart
✦ Definition~90s read
What is AutoSys Job Failure Handling and Restart?

AutoSys job failure handling restart is the systematic process of recovering a failed job, not a blind retry button. When a job exits with a non-zero status or is killed by the system, simply re-running it without diagnosis often leads to cascading failures, wasted cycles, and masked root causes.

When an AutoSys job fails at 3 AM, you need a clear playbook: find out why it failed, fix the issue, restart correctly, and verify recovery.

The restart mechanism in AutoSys — whether via sendevent -e FORCE_STARTJOB, sendevent -e KILLJOB followed by manual rerun, or using the autorestart attribute — is a tool for controlled recovery, but only after you've triaged the failure. In production environments handling thousands of jobs daily, treating restart as a first response rather than a last resort is a common anti-pattern that erodes reliability.

This process fits into the broader job lifecycle management ecosystem alongside tools like Control-M, Tivoli Workload Scheduler, or AWS Step Functions. Unlike those platforms, AutoSys lacks built-in intelligent retry with exponential backoff or automatic dependency resolution — you own the recovery logic.

When NOT to use a restart: if the failure is due to a systemic issue like a down database, full filesystem, or expired credentials, restarting is pointless until the underlying condition is fixed. The correct approach mirrors incident response in SRE: triage first (check exit code, box status, and job dependencies), then diagnose (parse the job's STDOUT/STDERR, the AutoSys event log, and the agent's autosys.log), then fix (correct the script, environment, or resource), and only then restart.

Concretely, a restart without diagnosis is gambling with your SLA. For example, a job that fails with exit code 1 due to a missing input file will fail again on restart unless you either restore the file or modify the job definition. The autorestart attribute, which retries up to a configured count, is useful only for transient failures like network timeouts — not for logic errors.

After restart, verification must go beyond the SUCCESS status in autorep: check that downstream jobs triggered correctly, that output files have the expected timestamps and sizes, and that no residual locks or zombie processes remain. Post-recovery, document the failure mode, the fix applied, and whether the job definition needs a permanent change — such as adding a pre-check or updating the max_run_alarm threshold — to prevent recurrence.

Plain-English First

When an AutoSys job fails at 3 AM, you need a clear playbook: find out why it failed, fix the issue, restart correctly, and verify recovery. This article is that playbook.

Job failures in AutoSys are inevitable. What separates good operators from great ones is the speed and thoroughness of their recovery process. This article walks through a complete failure handling workflow — from initial alert to verified recovery — the way experienced AutoSys admins actually do it in production.

Why AutoSys Job Failure Handling Restart Is Not a Retry Button

AutoSys job failure handling restart is a deterministic recovery mechanism that re-runs a failed job from its defined start point, not from where it crashed. The core mechanic: when a job exits with a non-zero status, AutoSys evaluates the job's 'term_run_time' and 'max_run_alarm' attributes, then applies the 'failure_exit' condition to decide whether to trigger a restart. This is not a simple retry — it's a stateful decision that respects job dependencies, box hierarchies, and global conditions.

In practice, the restart behavior is governed by the job's 'max_retry' and 'retry_interval' parameters. If max_retry is set to 3 with a retry_interval of 60 seconds, AutoSys will attempt to restart the job up to 3 times, waiting 60 seconds between each attempt. Critically, the restart does not reset the job's exit code history — the job's 'status' transitions from FAILURE to RESTART, and the 'exit_code' from the failed run persists in the job report. This means downstream jobs that depend on the failed job's exit code may still see the failure unless explicitly handled.

Use this mechanism when a job fails due to transient conditions — network timeouts, resource contention, or temporary file locks — but not for logic errors or data corruption. In real systems, misconfigured restart policies cause cascading failures: a job that fails due to a missing file will keep restarting, consuming resources and delaying manual intervention. The restart is a tactical recovery tool, not a substitute for root cause analysis.

Restart ≠ Clean State
A restarted job does not automatically clean its previous output files or reset its exit code — you must handle idempotency and state cleanup in the job script itself.
Production Insight
A batch processing job failed due to a full disk — restarting it 5 times (max_retry=5) only filled the disk further, taking down the entire job stream.
Symptom: jobs stuck in RESTART state, disk usage climbing, no alert on the root cause.
Rule of thumb: set max_retry to 1 or 2 for I/O-heavy jobs, and always monitor disk space before enabling auto-restart.
Key Takeaway
AutoSys restart is a stateful retry, not a clean slate — design jobs to be idempotent.
Set max_retry low (1-2) and use retry_interval to avoid hammering shared resources.
Never rely on restart to fix logic errors — it only masks the real problem.
Failure Recovery Flow Failure Recovery Flow. Step-by-step from alert to verified recovery · Alert fires · alarm_if_fail · Check scope · autorep -J % -s FA · Read error log THECODEFORGE.IOFailure Recovery FlowStep-by-step from alert to verified recovery Alert firesalarm_if_fail Check scopeautorep -J % -s FA Read error logcat std_err_file Fix root causecode/infra fix RESTART jobsendevent RESTART Verify outputcheck dataTHECODEFORGE.IO
thecodeforge.io
Failure Recovery Flow
Autosys Job Failure Handling Restart

Step 1: Triage — identify the failure

When paged, your first job is to understand the scope before touching anything. Don't jump into the first failed job — check how many others are down. A single failure is a script or dependency issue. A cascade points to an environment problem — machine, database, network, or upstream job chain.

triage.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
# 1a. How many jobs are failing?
autorep -J % -s FA | wc -l

# 1b. Are jobs failing on a specific machine?
autorep -J % -s FA | awk '{print $1}' | while read j; do
  autorep -J $j -d | grep machine
done

# 1c. Is the agent machine available?
autorep -M %  # check all machines — any MISSING?

# 1d. Are we looking at PEND_MACH instead of FAILURE?
autorep -J % -s PE | wc -l
Production Insight
A cascade of failures across unrelated jobs means the problem is environmental.
Single job failure is usually script or dependency — check its error log first.
Never restart a job without knowing the scope — you may restart it into a dying machine.
Key Takeaway
Triage scope before touching any job.
One failure = script/input problem; many failures = environment.
Check machine and agent status before restarting.

Step 2: Diagnose — read the error log

Never restart a job without reading the error log first. You'll restart it into the same failure and wonder why. AutoSys captures standard error and standard output. Always check both — sometimes the script succeeds but reports errors to stdout that don't affect exit code.

diagnose.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
# 2a. Get the error log path
autorep -J failed_job -d | grep -E 'std_err|std_out'

# 2b. Read the error log
tail -100 /logs/autosys/failed_job.err

# 2c. Read the AutoSys event log for this job
autorep -J failed_job -run 1   # previous run
autorep -J failed_job -d       # detailed current state

# 2d. Check the exit code
autostatus -J failed_job
Output
std_err_file: /logs/autosys/failed_job.err
Contents of failed_job.err:
[2026-03-19 22:02:15] ERROR: Cannot connect to database DB_PROD
[2026-03-19 22:02:15] Connection refused: prod-db-01:5432 (timeout after 30s)
[2026-03-19 22:02:15] Script exiting with code 1
Always read the error log BEFORE restarting
If the database is down, restarting the job gets you nothing. Fix the underlying issue first. Reading the error log takes 30 seconds; diagnosing why a restart failed again takes much longer.
Production Insight
In production, error logs are your first source of truth — don't rely on exit codes alone.
Scripts often exit 0 even when they produce incorrect output.
Check both std_err and std_out — the real problem may be hidden in the output log.
Key Takeaway
Always read the error log before restart.
Check both std_err and std_out.
The error log is free — ignoring it costs time and pager stress.

Step 3: Fix and restart correctly

After fixing the root cause, restart the job using the right command. RESTART is the safe option — it clears the failure status and restarts the job with the same conditions (dependencies, machine requirements). FORCE_STARTJOB bypasses all conditions — use it when you need to start a job that is still blocked by unmet dependencies (e.g., after manually fixing data). For box failures, never restart the box without first resetting the failed inner jobs — otherwise the box will immediately re-fail.

restart.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Option A: RESTART (for failed jobs — clean retry)
sendevent -E RESTART -J failed_job

# Option B: FORCE_STARTJOB (bypasses conditions — use when needed)
sendevent -E FORCE_STARTJOB -J failed_job

# Option C: If a BOX failed — restart the box and all its inner jobs
# First, reset all inner FAILURE jobs:
sendevent -E CHANGE_STATUS -J inner_job1 -s INACTIVE
sendevent -E CHANGE_STATUS -J inner_job2 -s INACTIVE
# Then restart the box:
sendevent -E FORCE_STARTJOB -J eod_box

# Monitor the restart
watch -n 10 'autorep -J failed_job'
Output
/* 03:41:00 — failed_job: STARTING
03:41:02 — failed_job: RUNNING
03:49:33 — failed_job: SUCCESS
03:49:34 — downstream_job: STARTING */
Production Insight
RESTART respects job dependencies; FORCE_STARTJOB does not.
Resetting box inner jobs to INACTIVE prevents the box from re-failing immediately.
Always monitor the restart: a job that goes directly from RUNNING to FAILURE in seconds indicates an unfixed issue.
Key Takeaway
RESTART = safe retry; FORCE_STARTJOB = bypass conditions.
Boxes need inner INACTIVE reset first.
Monitor the restart — quick fail means root cause still present.

Step 4: Verify recovery — beyond the SUCCESS status

A job showing SUCCESS in AutoSys means its exit code was 0. It does not mean the data is correct or the downstream chain completed. Verification is the step that separates reliable operators from ones who get called back an hour later. Check that all downstream jobs in the chain succeeded, and then validate the actual output: file size, row count, or a simple data check.

verify_recovery.shBASH
1
2
3
4
5
6
7
8
9
# Check all downstream jobs in the chain
autorep -J eod_%

# Confirm the box moved to SUCCESS
autorep -J eod_processing_box

# Check the actual output — did the script produce the expected file?
ls -la /data/output/daily_report_$(date +%Y%m%d).csv
wc -l /data/output/daily_report_$(date +%Y%m%d).csv  # check row count
Output
Job Name ST Exit
eod_extract SU 0
eod_transform SU 0
eod_load SU 0
eod_report SU 0
eod_processing_box SU --
/data/output/daily_report_20260319.csv: 54823 lines
Production Insight
SUCCESS in AutoSys means exit code 0, not correct data.
A job that exits 0 with an empty file is still SUCCESS — and will corrupt downstream processes.
Always validate output for critical jobs — row count, checksum, or a sample read.
Key Takeaway
SUCCESS != correct data.
Always verify downstream chain and actual output.
An empty or corrupt file with exit code 0 is a silent production killer.

Step 5: Post-recovery actions — document and escalate

Recovery isn't complete until you've recorded what happened and why. This step prevents recurring failures and helps your team learn. Write a brief incident summary: what failed, root cause, fix applied, time to recover, and any follow-up needed (e.g., increase term_run_time, add a healthcheck to the database monitoring). If the failure indicates a systemic problem (e.g., multiple jobs using the same broken dependency), escalate to the appropriate team. Finally, update any runbooks or monitoring rules that could have caught the issue earlier.

post_recovery.shBASH
1
2
3
4
5
6
7
8
9
10
# Log the incident to a shared file or alerting system
echo "$(date): Job eod_extract failed — DB_PROD unreachable" >> /var/log/autosys_incidents.log
echo "Root cause: Network switch failure. Fixed by ops team." >> /var/log/autosys_incidents.log
echo "Recovery time: 12 min" >> /var/log/autosys_incidents.log

# If needed, update the job's term_run_time or add a retry policy
sendevent -E CHANGE_STATUS -J eod_extract -a term_run_time=1200

# Notify stakeholder
mail -s "AutoSys incident report: eod_extract" ops-team@company.com < /var/log/autosys_incidents.log
Why post-recovery matters
Every incident is a chance to improve your system. By documenting root cause and resolution, you build a knowledge base that reduces MTTR for future failures. Escalate systemic issues to prevent recurrence across the entire batch schedule.
Production Insight
Undeclared failures are repeat failures.
If you don't record the root cause, the next person will restart into the same broken environment.
Post-recovery documentation is the foundation of operational maturity.
Key Takeaway
Always document the incident.
Escalate systemic issues.
Post-recovery is when you prevent the next failure.

The 3AM Blind Spot: Why a SUCCESS Exit Code Can Lie to You

You fixed the job. Restarted it. Exit code 0. Good night, right? Wrong. A SUCCESS status only tells you the shell command didn't crash. It does not tell you the data landed in the right table, or that the file transfer completed without corruption, or that your downstream job didn't consume garbage.

In production, I've seen AutoSys jobs exit cleanly while writing to a full disk, hitting a stale symlink, or processing yesterday's snapshot instead of today's. The OS says it's fine. AutoSys says it's fine. But your business logic just silently robbed a bank.

The fix: always add a post-success validation command inside the job definition. Use validate_cmd or chain a lightweight verification script that checks checksums, row counts, or API responses. Never trust the exit code alone. Trust your validation layer.

ValidateAfterSuccess.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// io.thecodeforge — devops tutorial
// This job validates data integrity after the main process

job_name: data_pipeline_validate
  machine: prod_app_server_01
  command: /opt/scripts/validate_incremental_load.sh
  
  // Run only after main job finishes, regardless of exit?
  // No. Only run if main job exits 0.
  condition: s(data_pipeline_main)
  
  // If validation fails, fail this job loudly
  failure_exit_codes: 10-99
  
  // Capture both stdout and stderr for forensic log
  std_out_file: /var/log/autosys/validation_<JIID>.out
  std_err_file: /var/log/autosys/validation_<JIID>.err
  
  // Notify on validation failure, not on the main job
  notification: mailx -s "FAILED: data_pipeline_validate" ops@company.com
Output
Successfully inserted validation job.
JOB: data_pipeline_validate ... ALIAS NOT FOUND (expected)
Status after validation: SUCCESS (checksum match)
No downstream triggers fired (validation passed).
Production Trap:
Never add validation inside the same command as the main job. If validation crashes, the main job's exit code is lost. Separate them into sibling jobs with a condition of 's(main)' so each failure path is independently debuggable.
Key Takeaway
SUCCESS only means the shell didn't crash. Always pair your main job with a validation sibling that checks business logic, not just the OS.

The AutoSys Auto-Restart Trap: When "Retry" Is Sabotage

Every junior's first instinct: enable auto-retry in the job definition so the system handles it. Stop. Auto-retry is not a recovery strategy. It is a bandage that masks transient failures and turns permanent ones into infinite loops.

Here's the rule: auto-retry only for infrastructure hiccups — network timeouts, disk pressure, DNS failures. Never for logic errors, data corruption, or missing dependencies. If your job fails because the input file is malformed, retrying it 47 times will never fix the file. It just screams louder.

Config your auto-retry with max_retry: 3 and a backoff formula that doubles the interval. Monitor the retry count as an alert. If a job hits retry #3, don't restart — escalate. Write a wrapper that exits with a non-standard code (e.g., 127) for logic failures, then set failure_exit_codes: 127 to disable auto-retry for those cases.

SmartRetryConfig.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// io.thecodeforge — devops tutorial
// Intelligent retry with failure code discrimination

job_name: ingestion_api_call
  machine: prod_batch_01
  command: /opt/scripts/ingest_partner_data.sh
  
  // Only retry on specific exit codes (infra failures)
  failure_exit_codes: 0-10, 70-99
  max_retry: 3
  retry_intervals: 30, 60, 120   // exponential backoff: 30s, 60s, 120s
  
  // Logic failures (code 127) must NOT be retried
  // They go straight to manual escalation
  
  std_out_file: /var/log/autosys/ingest_<JIID>.out
  std_err_file: /var/log/autosys/ingest_<JIID>.err
  
  // Only send notification after final retry failure
  notification: echo "Job ${AUTO_JOB_NAME} failed after 3 retries" | mailx -s "AUTO-RETRY EXHAUSTED" oncall@team.com
Output
Job ingestion_api_call failed with exit code 6 (network timeout).
Retrying in 30 seconds...
Retry 1: SUCCESS
No further action needed.
Senior Shortcut:
Write a single shared wrapper script that maps failure codes to categories: 0-49 = retriable, 50-99 = permanent, 100+ = dependency missing. Then all your jobs inherit the same retry logic. One file to audit, not 200 job definitions.
Key Takeaway
Auto-retry with exit code discrimination is a scalpel. Auto-retry on everything is a hammer that breaks your own windows.
● Production incidentPOST-MORTEMseverity: high

The Silent Restart That Cost a Trading Window

Symptom
After a failed ETL job, operator issued RESTART immediately. Job failed again with same error. A downstream job that had already started (because the box status was SUCCESS from a previous run) then crashed because it received incomplete data.
Assumption
The failure was a transient script error — restart would fix it.
Root cause
The source database was unreachable due to a network switch failure. The second restart caused a downstream job to process stale data, corrupting the daily report.
Fix
Read error log first: autorep -J job -d | grep std_err. Saw connection refused. Fixed network path. Then RESTART after database was reachable. Used CHANGE_STATUS to mark affected downstream jobs as ON_ICE before restarting.
Key lesson
  • Always read the error log before any restart — the root cause is rarely the script.
  • Know the downstream dependency chain and protect it from partial data.
  • A forced restart without diagnosis is gambling with production.
Production debug guideSymptom → Action for common failure scenarios4 entries
Symptom · 01
Job shows FAILURE but no error log found
Fix
Check std_err_file path in job definition. Also check if the script redirected output elsewhere. Use autorep -J job -q to see full attributes.
Symptom · 02
Box in FAILURE, but inner jobs all show SUCCESS
Fix
The box failed due to a condition mismatch or a dependency failure outside the box. Check autorep -J box -d for exit code and look for NRI or TERM status on box.
Symptom · 03
Job stuck in STARTING or RUNNING for too long
Fix
Check agent machine availability (autorep -M), kill the stuck process with sendevent -E FORCE_STARTJOB -J job (yes, it will restart after kill). Also check term_run_time.
Symptom · 04
PEND_MACH with no agent issues
Fix
Check global variables (autorep -g), job dependency conditions (autorep -J job -d | grep condition), and run_window. Use sendevent -E FORCE_STARTJOB only as last resort.
★ Quick Debug Cheat SheetFor the 3 AM pager — commands to understand scope and act.
Paged for job failure
Immediate action
Count failed jobs
Commands
autorep -J % -s FA | wc -l
autorep -J % -s FA | head -20
Fix now
If single job: read its error log. If many: check machine availability.
Job failed, need error log+
Immediate action
Get error log path
Commands
autorep -J job -d | grep std_err
tail -100 /path/to/err_file
Fix now
Identify root cause: connection issue, script bug, resource shortage.
Box is FAILURE, need to recover+
Immediate action
List inner job statuses
Commands
autorep -J box_name%
autorep -J inner_failed -d
Fix now
Reset failed inner jobs to INACTIVE, then FORCE_STARTJOB the box.
Verifying recovery after restart+
Immediate action
Check restarted job status
Commands
autorep -J job
autorep -J job_name%
Fix now
If SUCCESS, verify actual output file/row count. If still FAILURE, re-diagnose.
Failure recovery decision table
ScenarioRecovery actionKey consideration
Single job FAILUREFix cause → RESTART or FORCE_STARTJOBRead error log first
BOX with one inner job failedFix cause → RESTART inner jobCheck if other inner jobs are now blocked
BOX in FAILURE with all inner jobs failedFix cause → FORCE_STARTJOB the boxMay need to reset inner jobs to INACTIVE first
PEND_MACHFix the agent machine → jobs auto-recoverDon't FORCE_STARTJOB while agent is down
TERMINATED by term_run_timeIncrease term_run_time → RESTARTInvestigate why job ran longer than expected

Key takeaways

1
Always read the error log before restarting
restart into a still-broken environment wastes time
2
For single job failures
fix cause → RESTART; for box failures: may need to reset inner jobs to INACTIVE first
3
After recovery, verify actual output
SUCCESS status means exit code 0, not that the data is correct
4
Document what went wrong and the fix in the incident log
recurring failures need root cause analysis

Common mistakes to avoid

4 patterns
×

Restarting without reading the error log

Symptom
Job fails again with same error; additional delays and pager fatigue.
Fix
Always run autorep -J job -d to find error log path, then tail the log before any restart.
×

Restarting a job inside a box without considering the box's state

Symptom
Inner job restarts but box remains FAILURE; downstream jobs not triggered.
Fix
Check box status with autorep -J box -d. If box is FAILURE, reset inner jobs to INACTIVE and restart the box, not just the inner job.
×

Not verifying actual output after recovery

Symptom
AutoSys reports SUCCESS but data is missing or corrupt; downstream failures happen later.
Fix
After restart, check actual output file size/row count and run a sample validation before declaring success.
×

Restarting the entire box when only one inner job failed

Symptom
Other inner jobs that completed successfully restart unnecessarily, extending recovery time and risking data duplication.
Fix
Restart only the failed inner job. If the box status allows it, the box will re-evaluate and succeed once that inner job completes.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
Walk me through how you would handle a failed AutoSys job in production.
Q02JUNIOR
What do you check before restarting a failed AutoSys job?
Q03SENIOR
If a BOX is in FAILURE because one inner job failed, how do you recover?
Q04SENIOR
What is the difference between restarting a job vs restarting its parent...
Q05SENIOR
After restarting a failed job, how do you verify the recovery was succes...
Q01 of 05SENIOR

Walk me through how you would handle a failed AutoSys job in production.

ANSWER
First, triage scope: check how many jobs failed (autorep -s FA). If many, investigate environment — machine, database, network. If single, read error log via autorep -d to get std_err path, then tail the log. Fix root cause (e.g., restart database, correct script input). Then issue RESTART for a clean retry, or FORCE_STARTJOB if conditions are blocking. Monitor restart with autorep -J and watch -n 10. After success, verify downstream chain and actual output. Finally, document the incident.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What should I do when an AutoSys job fails?
02
How do I restart an AutoSys BOX that failed?
03
My AutoSys job keeps failing at the same step — what should I do?
04
After restarting, how long should I monitor before declaring recovery complete?
05
How do I stop downstream jobs from running after a manual fix?
COMPLETE GUIDE
The Complete AutoSys Workload Automation Guide for Engineers →

JIL syntax, sendevent, autorep, box jobs, file watchers, scheduling, HA, security, cloud workload automation, and 22 interview questions — the definitive AutoSys reference for production engineers.

N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.

Follow
Verified
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
🔥

That's AutoSys. Mark it forged?

4 min read · try the examples if you haven't

Previous
AutoSys Fault Tolerance and Recovery
26 / 30 · AutoSys
Next
AutoSys Real-World Patterns and Best Practices