AutoSys Job Failure Handling and Restart Procedures
- Always read the error log before restarting — restart into a still-broken environment wastes time
- For single job failures: fix cause → RESTART; for box failures: may need to reset inner jobs to INACTIVE first
- After recovery, verify actual output — SUCCESS status means exit code 0, not that the data is correct
- Acute triage: check scope (autorep -s FA) before touching any job
- Always read std_err/std_out logs before restart — 90% of failures are environment, not script
- RESTART is clean; FORCE_STARTJOB bypasses conditions — use with intent
- Box failures require resetting inner jobs to INACTIVE before forced restart
- Verify downstream chain completes — SUCCESS != correct data
- Recurring failures at same step = root cause analysis, not another restart
Quick Debug Cheat Sheet
Paged for job failure
autorep -J % -s FA | wc -lautorep -J % -s FA | head -20Job failed, need error log
autorep -J job -d | grep std_errtail -100 /path/to/err_fileBox is FAILURE, need to recover
autorep -J box_name%autorep -J inner_failed -dVerifying recovery after restart
autorep -J jobautorep -J job_name%Production Incident
Production Debug GuideSymptom → Action for common failure scenarios
Job failures in AutoSys are inevitable. What separates good operators from great ones is the speed and thoroughness of their recovery process. This article walks through a complete failure handling workflow — from initial alert to verified recovery — the way experienced AutoSys admins actually do it in production.
Step 1: Triage — identify the failure
When paged, your first job is to understand the scope before touching anything. Don't jump into the first failed job — check how many others are down. A single failure is a script or dependency issue. A cascade points to an environment problem — machine, database, network, or upstream job chain.
# 1a. How many jobs are failing? autorep -J % -s FA | wc -l # 1b. Are jobs failing on a specific machine? autorep -J % -s FA | awk '{print $1}' | while read j; do autorep -J $j -d | grep machine done # 1c. Is the agent machine available? autorep -M % # check all machines — any MISSING? # 1d. Are we looking at PEND_MACH instead of FAILURE? autorep -J % -s PE | wc -l
Step 2: Diagnose — read the error log
Never restart a job without reading the error log first. You'll restart it into the same failure and wonder why. AutoSys captures standard error and standard output. Always check both — sometimes the script succeeds but reports errors to stdout that don't affect exit code.
# 2a. Get the error log path autorep -J failed_job -d | grep -E 'std_err|std_out' # 2b. Read the error log tail -100 /logs/autosys/failed_job.err # 2c. Read the AutoSys event log for this job autorep -J failed_job -run 1 # previous run autorep -J failed_job -d # detailed current state # 2d. Check the exit code autostatus -J failed_job
Contents of failed_job.err:
[2026-03-19 22:02:15] ERROR: Cannot connect to database DB_PROD
[2026-03-19 22:02:15] Connection refused: prod-db-01:5432 (timeout after 30s)
[2026-03-19 22:02:15] Script exiting with code 1
Step 3: Fix and restart correctly
After fixing the root cause, restart the job using the right command. RESTART is the safe option — it clears the failure status and restarts the job with the same conditions (dependencies, machine requirements). FORCE_STARTJOB bypasses all conditions — use it when you need to start a job that is still blocked by unmet dependencies (e.g., after manually fixing data). For box failures, never restart the box without first resetting the failed inner jobs — otherwise the box will immediately re-fail.
# Option A: RESTART (for failed jobs — clean retry) sendevent -E RESTART -J failed_job # Option B: FORCE_STARTJOB (bypasses conditions — use when needed) sendevent -E FORCE_STARTJOB -J failed_job # Option C: If a BOX failed — restart the box and all its inner jobs # First, reset all inner FAILURE jobs: sendevent -E CHANGE_STATUS -J inner_job1 -s INACTIVE sendevent -E CHANGE_STATUS -J inner_job2 -s INACTIVE # Then restart the box: sendevent -E FORCE_STARTJOB -J eod_box # Monitor the restart watch -n 10 'autorep -J failed_job'
03:41:02 — failed_job: RUNNING
03:49:33 — failed_job: SUCCESS
03:49:34 — downstream_job: STARTING */
Step 4: Verify recovery — beyond the SUCCESS status
A job showing SUCCESS in AutoSys means its exit code was 0. It does not mean the data is correct or the downstream chain completed. Verification is the step that separates reliable operators from ones who get called back an hour later. Check that all downstream jobs in the chain succeeded, and then validate the actual output: file size, row count, or a simple data check.
# Check all downstream jobs in the chain autorep -J eod_% # Confirm the box moved to SUCCESS autorep -J eod_processing_box # Check the actual output — did the script produce the expected file? ls -la /data/output/daily_report_$(date +%Y%m%d).csv wc -l /data/output/daily_report_$(date +%Y%m%d).csv # check row count
eod_extract SU 0
eod_transform SU 0
eod_load SU 0
eod_report SU 0
eod_processing_box SU --
/data/output/daily_report_20260319.csv: 54823 lines
Step 5: Post-recovery actions — document and escalate
Recovery isn't complete until you've recorded what happened and why. This step prevents recurring failures and helps your team learn. Write a brief incident summary: what failed, root cause, fix applied, time to recover, and any follow-up needed (e.g., increase term_run_time, add a healthcheck to the database monitoring). If the failure indicates a systemic problem (e.g., multiple jobs using the same broken dependency), escalate to the appropriate team. Finally, update any runbooks or monitoring rules that could have caught the issue earlier.
# Log the incident to a shared file or alerting system echo "$(date): Job eod_extract failed — DB_PROD unreachable" >> /var/log/autosys_incidents.log echo "Root cause: Network switch failure. Fixed by ops team." >> /var/log/autosys_incidents.log echo "Recovery time: 12 min" >> /var/log/autosys_incidents.log # If needed, update the job's term_run_time or add a retry policy sendevent -E CHANGE_STATUS -J eod_extract -a term_run_time=1200 # Notify stakeholder mail -s "AutoSys incident report: eod_extract" ops-team@company.com < /var/log/autosys_incidents.log
| Scenario | Recovery action | Key consideration |
|---|---|---|
| Single job FAILURE | Fix cause → RESTART or FORCE_STARTJOB | Read error log first |
| BOX with one inner job failed | Fix cause → RESTART inner job | Check if other inner jobs are now blocked |
| BOX in FAILURE with all inner jobs failed | Fix cause → FORCE_STARTJOB the box | May need to reset inner jobs to INACTIVE first |
| PEND_MACH | Fix the agent machine → jobs auto-recover | Don't FORCE_STARTJOB while agent is down |
| TERMINATED by term_run_time | Increase term_run_time → RESTART | Investigate why job ran longer than expected |
🎯 Key Takeaways
- Always read the error log before restarting — restart into a still-broken environment wastes time
- For single job failures: fix cause → RESTART; for box failures: may need to reset inner jobs to INACTIVE first
- After recovery, verify actual output — SUCCESS status means exit code 0, not that the data is correct
- Document what went wrong and the fix in the incident log — recurring failures need root cause analysis
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QWalk me through how you would handle a failed AutoSys job in production.Mid-levelReveal
- QWhat do you check before restarting a failed AutoSys job?JuniorReveal
- QIf a BOX is in FAILURE because one inner job failed, how do you recover?SeniorReveal
- QWhat is the difference between restarting a job vs restarting its parent box?SeniorReveal
- QAfter restarting a failed job, how do you verify the recovery was successful?Mid-levelReveal
Frequently Asked Questions
What should I do when an AutoSys job fails?
First, read the error log (std_err_file) to understand why it failed. Fix the underlying issue. Then restart with RESTART or FORCE_STARTJOB. Monitor until SUCCESS. Finally, verify the actual output is correct — not just that AutoSys reports success.
How do I restart an AutoSys BOX that failed?
For a BOX failure, you usually need to reset failed inner jobs to INACTIVE first (CHANGE_STATUS -s INACTIVE), then FORCE_STARTJOB the box. Alternatively, restart only the failed inner jobs while leaving succeeded ones in SUCCESS — the box will recheck and proceed.
My AutoSys job keeps failing at the same step — what should I do?
A recurring failure at the same step usually indicates a root cause in the script, its configuration, or an external dependency. Check the error log in detail, add more logging to the script, and engage the application team. Restarting repeatedly without fixing the root cause is a treadmill.
After restarting, how long should I monitor before declaring recovery complete?
Monitor until the full downstream job chain completes successfully — not just the restarted job. Use autorep -J box_name% to check all inner jobs. For critical workflows, also verify the actual output data.
How do I stop downstream jobs from running after a manual fix?
If you've fixed data manually outside AutoSys and don't want downstream jobs running again, either use CHANGE_STATUS to mark jobs SUCCESS (to signal done) or put downstream jobs ON_HOLD before the upstream restart, then release them when appropriate.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.