AutoSys Restart Failures — Diagnosis Before Recovery
Blind restart after ETL failure caused downstream job to process stale data.
20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.
- Acute triage: check scope (autorep -s FA) before touching any job
- Always read std_err/std_out logs before restart — 90% of failures are environment, not script
- RESTART is clean; FORCE_STARTJOB bypasses conditions — use with intent
- Box failures require resetting inner jobs to INACTIVE before forced restart
- Verify downstream chain completes — SUCCESS != correct data
- Recurring failures at same step = root cause analysis, not another restart
When an AutoSys job fails at 3 AM, you need a clear playbook: find out why it failed, fix the issue, restart correctly, and verify recovery. This article is that playbook.
Job failures in AutoSys are inevitable. What separates good operators from great ones is the speed and thoroughness of their recovery process. This article walks through a complete failure handling workflow — from initial alert to verified recovery — the way experienced AutoSys admins actually do it in production.
Why AutoSys Job Failure Handling Restart Is Not a Retry Button
AutoSys job failure handling restart is a deterministic recovery mechanism that re-runs a failed job from its defined start point, not from where it crashed. The core mechanic: when a job exits with a non-zero status, AutoSys evaluates the job's 'term_run_time' and 'max_run_alarm' attributes, then applies the 'failure_exit' condition to decide whether to trigger a restart. This is not a simple retry — it's a stateful decision that respects job dependencies, box hierarchies, and global conditions.
In practice, the restart behavior is governed by the job's 'max_retry' and 'retry_interval' parameters. If max_retry is set to 3 with a retry_interval of 60 seconds, AutoSys will attempt to restart the job up to 3 times, waiting 60 seconds between each attempt. Critically, the restart does not reset the job's exit code history — the job's 'status' transitions from FAILURE to RESTART, and the 'exit_code' from the failed run persists in the job report. This means downstream jobs that depend on the failed job's exit code may still see the failure unless explicitly handled.
Use this mechanism when a job fails due to transient conditions — network timeouts, resource contention, or temporary file locks — but not for logic errors or data corruption. In real systems, misconfigured restart policies cause cascading failures: a job that fails due to a missing file will keep restarting, consuming resources and delaying manual intervention. The restart is a tactical recovery tool, not a substitute for root cause analysis.
Step 1: Triage — identify the failure
When paged, your first job is to understand the scope before touching anything. Don't jump into the first failed job — check how many others are down. A single failure is a script or dependency issue. A cascade points to an environment problem — machine, database, network, or upstream job chain.
Step 2: Diagnose — read the error log
Never restart a job without reading the error log first. You'll restart it into the same failure and wonder why. AutoSys captures standard error and standard output. Always check both — sometimes the script succeeds but reports errors to stdout that don't affect exit code.
Step 3: Fix and restart correctly
After fixing the root cause, restart the job using the right command. RESTART is the safe option — it clears the failure status and restarts the job with the same conditions (dependencies, machine requirements). FORCE_STARTJOB bypasses all conditions — use it when you need to start a job that is still blocked by unmet dependencies (e.g., after manually fixing data). For box failures, never restart the box without first resetting the failed inner jobs — otherwise the box will immediately re-fail.
Step 4: Verify recovery — beyond the SUCCESS status
A job showing SUCCESS in AutoSys means its exit code was 0. It does not mean the data is correct or the downstream chain completed. Verification is the step that separates reliable operators from ones who get called back an hour later. Check that all downstream jobs in the chain succeeded, and then validate the actual output: file size, row count, or a simple data check.
Step 5: Post-recovery actions — document and escalate
Recovery isn't complete until you've recorded what happened and why. This step prevents recurring failures and helps your team learn. Write a brief incident summary: what failed, root cause, fix applied, time to recover, and any follow-up needed (e.g., increase term_run_time, add a healthcheck to the database monitoring). If the failure indicates a systemic problem (e.g., multiple jobs using the same broken dependency), escalate to the appropriate team. Finally, update any runbooks or monitoring rules that could have caught the issue earlier.
The 3AM Blind Spot: Why a SUCCESS Exit Code Can Lie to You
You fixed the job. Restarted it. Exit code 0. Good night, right? Wrong. A SUCCESS status only tells you the shell command didn't crash. It does not tell you the data landed in the right table, or that the file transfer completed without corruption, or that your downstream job didn't consume garbage.
In production, I've seen AutoSys jobs exit cleanly while writing to a full disk, hitting a stale symlink, or processing yesterday's snapshot instead of today's. The OS says it's fine. AutoSys says it's fine. But your business logic just silently robbed a bank.
The fix: always add a post-success validation command inside the job definition. Use validate_cmd or chain a lightweight verification script that checks checksums, row counts, or API responses. Never trust the exit code alone. Trust your validation layer.
The AutoSys Auto-Restart Trap: When "Retry" Is Sabotage
Every junior's first instinct: enable auto-retry in the job definition so the system handles it. Stop. Auto-retry is not a recovery strategy. It is a bandage that masks transient failures and turns permanent ones into infinite loops.
Here's the rule: auto-retry only for infrastructure hiccups — network timeouts, disk pressure, DNS failures. Never for logic errors, data corruption, or missing dependencies. If your job fails because the input file is malformed, retrying it 47 times will never fix the file. It just screams louder.
Config your auto-retry with max_retry: 3 and a backoff formula that doubles the interval. Monitor the retry count as an alert. If a job hits retry #3, don't restart — escalate. Write a wrapper that exits with a non-standard code (e.g., 127) for logic failures, then set failure_exit_codes: 127 to disable auto-retry for those cases.
The Silent Restart That Cost a Trading Window
- Always read the error log before any restart — the root cause is rarely the script.
- Know the downstream dependency chain and protect it from partial data.
- A forced restart without diagnosis is gambling with production.
autorep -J % -s FA | wc -lautorep -J % -s FA | head -20Key takeaways
Common mistakes to avoid
4 patternsRestarting without reading the error log
Restarting a job inside a box without considering the box's state
Not verifying actual output after recovery
Restarting the entire box when only one inner job failed
Interview Questions on This Topic
Walk me through how you would handle a failed AutoSys job in production.
Frequently Asked Questions
JIL syntax, sendevent, autorep, box jobs, file watchers, scheduling, HA, security, cloud workload automation, and 22 interview questions — the definitive AutoSys reference for production engineers.
20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.
That's AutoSys. Mark it forged?
4 min read · try the examples if you haven't