AutoSys Restart Failures — Diagnosis Before Recovery
Blind restart after ETL failure caused downstream job to process stale data.
- Acute triage: check scope (autorep -s FA) before touching any job
- Always read std_err/std_out logs before restart — 90% of failures are environment, not script
- RESTART is clean; FORCE_STARTJOB bypasses conditions — use with intent
- Box failures require resetting inner jobs to INACTIVE before forced restart
- Verify downstream chain completes — SUCCESS != correct data
- Recurring failures at same step = root cause analysis, not another restart
When an AutoSys job fails at 3 AM, you need a clear playbook: find out why it failed, fix the issue, restart correctly, and verify recovery. This article is that playbook.
Job failures in AutoSys are inevitable. What separates good operators from great ones is the speed and thoroughness of their recovery process. This article walks through a complete failure handling workflow — from initial alert to verified recovery — the way experienced AutoSys admins actually do it in production.
Step 1: Triage — identify the failure
When paged, your first job is to understand the scope before touching anything. Don't jump into the first failed job — check how many others are down. A single failure is a script or dependency issue. A cascade points to an environment problem — machine, database, network, or upstream job chain.
Step 2: Diagnose — read the error log
Never restart a job without reading the error log first. You'll restart it into the same failure and wonder why. AutoSys captures standard error and standard output. Always check both — sometimes the script succeeds but reports errors to stdout that don't affect exit code.
Step 3: Fix and restart correctly
After fixing the root cause, restart the job using the right command. RESTART is the safe option — it clears the failure status and restarts the job with the same conditions (dependencies, machine requirements). FORCE_STARTJOB bypasses all conditions — use it when you need to start a job that is still blocked by unmet dependencies (e.g., after manually fixing data). For box failures, never restart the box without first resetting the failed inner jobs — otherwise the box will immediately re-fail.
Step 4: Verify recovery — beyond the SUCCESS status
A job showing SUCCESS in AutoSys means its exit code was 0. It does not mean the data is correct or the downstream chain completed. Verification is the step that separates reliable operators from ones who get called back an hour later. Check that all downstream jobs in the chain succeeded, and then validate the actual output: file size, row count, or a simple data check.
Step 5: Post-recovery actions — document and escalate
Recovery isn't complete until you've recorded what happened and why. This step prevents recurring failures and helps your team learn. Write a brief incident summary: what failed, root cause, fix applied, time to recover, and any follow-up needed (e.g., increase term_run_time, add a healthcheck to the database monitoring). If the failure indicates a systemic problem (e.g., multiple jobs using the same broken dependency), escalate to the appropriate team. Finally, update any runbooks or monitoring rules that could have caught the issue earlier.
The Silent Restart That Cost a Trading Window
- Always read the error log before any restart — the root cause is rarely the script.
- Know the downstream dependency chain and protect it from partial data.
- A forced restart without diagnosis is gambling with production.
Key takeaways
Common mistakes to avoid
4 patternsRestarting without reading the error log
Restarting a job inside a box without considering the box's state
Not verifying actual output after recovery
Restarting the entire box when only one inner job failed
Interview Questions on This Topic
Walk me through how you would handle a failed AutoSys job in production.
Frequently Asked Questions
JIL syntax, sendevent, autorep, box jobs, file watchers, scheduling, HA, security, cloud workload automation, and 22 interview questions — the definitive AutoSys reference for production engineers.
That's AutoSys. Mark it forged?
3 min read · try the examples if you haven't