Senior 9 min · March 19, 2026
AutoSys Fault Tolerance and Recovery

AutoSys box_terminator — Prevent Silent Validation Fails

A failed validation job let downstream jobs run on corrupt data — $2.4M payroll error.

N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Lessons pulled from things that broke in production.

Follow
Production
production tested
May 24, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • n_retrys: auto-retry a failed job up to N times for transient failures
  • box_terminator: stop the entire box when a critical job fails
  • Dual Event Server HA: infrastructure-level failover for the AutoSys scheduler
  • alarm_if_fail: notify only after all retries exhausted — don't wake ops for blips
  • term_run_time: kill hung jobs so downstream isn't blocked indefinitely
  • Biggest mistake: setting n_retrys too high masks permanent failures and delays escalation by hours
✦ Definition~90s read
What is AutoSys Fault Tolerance and Recovery?

AutoSys box_terminator is a job attribute that forces an entire box (container) to immediately fail when a specific job within it fails, preventing the silent continuation of downstream jobs that depend on that failed job's output. Without box_terminator, a job failure inside a box often goes unnoticed because the box itself may still report as running or succeeding, leading to corrupted data pipelines, incomplete ETL processes, or unreported outages.

Fault tolerance in AutoSys is like building redundancy into your plans.

This attribute exists because AutoSys boxes, by default, do not propagate job failures upward — they only track status, not enforce dependency chains. You set box_terminator=1 on a critical job inside a box, and when that job fails, the box is killed instantly, triggering any alarm_if_fail or notification logic you've configured.

This is essential for fault-tolerant scheduling where a single validation step (e.g., file arrival check, data quality gate) must stop all subsequent processing rather than letting the pipeline run on stale or missing data. Alternatives include using job dependencies outside boxes or custom exit code handling, but box_terminator is the simplest native mechanism for fail-fast behavior in complex job networks.

Do not use it for non-critical jobs where you want the box to continue despite a failure — that's what n_retrys and conditional logic are for.

Plain-English First

Fault tolerance in AutoSys is like building redundancy into your plans. If the main road is blocked (job fails), you want automatic detours (retries), emergency alerts (alarms), and a backup plan (recovery jobs). Good fault tolerance means problems get handled automatically at 3 AM without waking anyone up.

AutoSys fault tolerance recovery is the safety net that catches batch jobs when your primary strategy fails. Without it, a single failed dependency can silently corrupt a month-end report or leave a critical data feed in an inconsistent state. Developers need it because AutoSys jobs don't restart themselves, and your monitoring dashboard won't tell you when a job finished but produced garbage—only explicit recovery patterns like box_terminator, n_retrys, and alarm_if_fail prevent those silent validation failures from becoming production fires.

Why AutoSys box_terminator Exists — Stop Silent Validation Fails

AutoSys box_terminator is a job attribute that forces a box job to abort immediately when any child job inside it fails, rather than continuing to run and potentially completing with a 'SUCCESS' status despite underlying failures. Without it, a box job aggregates child statuses using a default logic that can mask failures — a box can finish SUCCESS even when critical steps inside it have failed, as long as the box's own exit code or status logic doesn't detect it. This is the core mechanic: box_terminator=true makes the box fail atomically on first child failure, preventing silent data corruption or downstream job triggers based on a false positive.

In practice, box_terminator works by setting the box job's status to FAILURE as soon as any child job exits with a non-zero status, and then terminating all remaining running children (SIGTERM, then SIGKILL after a grace period). This is not a soft stop — it's a hard abort. Key properties: it's a boolean attribute (default false), it applies only to box jobs, and it does not wait for children to finish gracefully. The termination order is immediate: the box fails, then children are killed. This means you lose any cleanup logic in downstream children — design for that.

Use box_terminator=true in any box where a single failure invalidates the entire batch — for example, an ETL pipeline where a failed extract step means the load step would process stale or partial data. It matters because without it, you get 'silent validation fails': the box reports SUCCESS, downstream jobs trigger on a 'successful' box, and data quality issues surface hours later in production reports. Real systems use this to enforce transactional boundaries across job steps.

Not a Graceful Shutdown
box_terminator kills children immediately — it does not wait for them to finish cleanup. Design child jobs to be idempotent or handle forced termination.
Production Insight
ETL pipeline where extract step fails but load step runs on partial data — box reports SUCCESS, downstream dashboards show wrong numbers.
Symptom: box job status is SUCCESS but data in target table is incomplete or corrupted — no alert fires because the box 'succeeded'.
Rule: always set box_terminator=true on any box where a child failure invalidates the entire batch — never rely on exit code aggregation alone.
Key Takeaway
box_terminator=true makes a box fail atomically on first child failure — prevents silent SUCCESS from partial failures.
Default is false: boxes can succeed even when children fail — always set it explicitly for critical workflows.
Children are killed hard (SIGTERM then SIGKILL) — design for abrupt termination, not graceful shutdown.
Fault Tolerance Layers Fault Tolerance Layers. Job · Box · Infrastructure · Job Level · n_retrys: auto-retry · term_run_time: kill hung · alarm_if_fail: alert team · n_retrys: 2 for transientTHECODEFORGE.IOFault Tolerance LayersJob · Box · Infrastructure Job Leveln_retrys: auto-retryterm_run_time: kill hungalarm_if_fail: alert teamn_retrys: 2 for transient Box Levelbox_terminator: kill boxValidation job as terminatorsuccess() chain controldone() for cleanup jobs InfrastructureDual Event Server HAShadow auto-promotesTie-Breaker arbitratesAgent redundancy planningTHECODEFORGE.IO
thecodeforge.io
Fault Tolerance Layers
Autosys Fault Tolerance Recovery

Automatic retry with n_retrys

The simplest fault tolerance mechanism. n_retrys tells AutoSys to automatically rerun a failed job N times before declaring it a final FAILURE. This handles transient failures like brief network blips or temporary database connection issues.

Here's the thing: each retry is a full new attempt — the job script runs again from scratch. AutoSys doesn't resume from where it left off. So if your job is not idempotent, retries can cause data duplication or corruption. For example, an INSERT without a uniqueness check will happily create duplicate rows on each retry. Make sure your scripts handle re-entry safely: use idempotency keys, checkpoints, or database MERGE (upsert) logic.

When setting n_retrys, choose a number that matches the expected transient window. If network blips last ~30 seconds, and your job runs in 2 minutes, n_retrys: 3 gives about 6 minutes of recovery time. That's enough for most intermittent issues without delaying the pipeline too much.

retry_config.jilBASH
1
2
3
4
5
6
7
8
9
10
11
12
insert_job: extract_market_data
job_type: CMD
command: /scripts/extract_market.sh
machine: data-server-01
owner: batchuser
date_conditions: 1
days_of_week: all
start_times: "18:00"
n_retrys: 3            /* retry up to 3 times after initial failure = 4 total attempts */
alarm_if_fail: 1       /* alarm only after all retries exhausted */
term_run_time: 45      /* kill if running over 45 minutes */
std_err_file: /logs/autosys/extract_market_data.err
Output
/* Execution sequence on failure:
18:00:01 — Attempt 1: FAILURE (exit code 1)
18:00:31 — Retry 1: FAILURE (exit code 1)
18:01:01 — Retry 2: SUCCESS (exit code 0)
18:01:01 — extract_market_data: SUCCESS — downstream jobs proceed */
Idempotency is not optional
If your script inserts data, writes to a file, or sends an API call, each retry repeats that action. Without idempotency, you'll get duplicate rows, corrupt files, or duplicate charges. Test your script's behaviour under retry before putting it in production.
Production Insight
n_retrys masks flaky scripts. If a job fails intermittently and retry succeeds, you never investigate the root cause — until the underlying issue grows worse.
A job that always fails on retry 3 may indicate resource exhaustion (temp tablespace, file handles) that only triggers under load.
Rule: set a maximum of 3 retries and monitor retry counts via AutoSys reports. A job that retries every night needs investigation, not tolerance.
Key Takeaway
n_retrys is for transient failures, not buggy scripts.
Set 2–3 retries max and pair with idempotent job logic.
If a job uses all retries, treat it as an incident — not a new baseline.

box_terminator — stopping the box on critical failure

In a BOX with multiple independent jobs, a failure in one job normally leaves other jobs to continue. If one job's failure should stop everything — because its output is required or its failure invalidates all subsequent work — mark it as a box_terminator.

When a job with box_terminator:1 fails, AutoSys immediately transitions the parent box to FAILURE. All currently pending inner jobs are skipped (their status becomes TERMINATED). Any jobs already running are killed. This prevents wasted compute on bad data and reduces the time to detect and recover.

In practice, use box_terminator on
  • Data validation jobs (schema checks, referential integrity)
  • Prerequisite extraction jobs (if upstream source is unavailable)
  • Configuration or lookup table loads (everything depends on them)

Do not use box_terminator on jobs that have graceful degradation paths. If a downstream job can handle missing data (e.g., produce a partial report with a warning), let it run.

box_terminator.jilBASH
1
2
3
4
5
6
7
8
9
10
11
insert_job: validate_input_data
job_type: CMD
box_name: eod_box
command: /scripts/validate.sh
machine: server01
owner: batch
box_terminator: 1          /* if this fails, the entire box fails immediately */
alarm_if_fail: 1

/* Without box_terminator: other jobs in the box would continue even after validate fails */
/* With box_terminator: box immediately moves to FAILURE, all pending inner jobs skip */
Put validation jobs as box_terminators
Data validation jobs are ideal box_terminator candidates. If input data is invalid, there's no point running any of the downstream processing jobs — they'd produce bad output. Mark the validation job as box_terminator: 1 to stop the entire box immediately on validation failure.
Production Insight
A validation job that is not a box_terminator allows downstream jobs to run on garbage data. The result: corrupt output that passes all success checks.
Debugging that scenario is brutal — every downstream job shows SUCCESS, but the data is wrong. You lose hours tracing the problem back.
Rule: if the job's output is a hard prerequisite for everything that follows, it must be a box_terminator. No exceptions.
Key Takeaway
box_terminator stops the entire box on failure.
Use it on validation, extraction, and configuration jobs — anything whose failure makes downstream work worthless.
Without it, a box can 'succeed' even when critical jobs inside it fail.

alarm_if_fail and notification — when to wake someone up

alarm_if_fail:1 tells AutoSys to trigger an alarm when a job fails. But the timing matters: if you also have n_retrys > 0, the alarm only fires after all retries are exhausted. That's the right behaviour for transient failures — you don't want the on-call engineer paged for a 30-second network glitch.

However, some jobs should always alarm on the first failure, regardless of retries. For those, consider splitting the job: set a dummy pre-step that does the retry logic, and the main job with alarm_if_fail:1 and n_retrys:0. Or use a different notification mechanism: a custom script that sends a page on exit code != 0.

In AutoSys, the alarm mechanism is typically configured in WCC or via an external event handler. The job attribute alarm_if_fail sets a flag that AutoSys propagates to the event server. Make sure your notification system (email, SMS, PagerDuty) is subscribed to these events. Many teams set up automated alerting rules that trigger on job status FAILURE, but if those rules don't respect the retry state, they may fire on every transient blip.

alarm_config.jilBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
insert_job: critical_report_generation
job_type: CMD
box_name: eod_box
command: /scripts/generate_report.sh
machine: server01
owner: batch
n_retrys: 0              /* no retries — always alarm on first failure */
alarm_if_fail: 1          /* alarm immediately */

/* Alternative: job that retries, but you want alarm on first failure too */
/* Use a wrapper script that pages on exit code 1 */
insert_job: critical_workflow
job_type: CMD
command: /scripts/run_and_page.sh   /* wrapper does retry logic internally */
n_retrys: 0
alarm_if_fail: 1
Don't overlook the retry → alarm delay
If a job has n_retrys:3 and alarm_if_fail:1, an engineer won't be notified until at least 3 retries have occurred. If each retry takes 5 minutes, that's 15 minutes of delay. For critical jobs, that may be too long. Consider reducing retries or using custom alerting.
Production Insight
Many teams assume alarm_if_fail fires instantly on job failure. But combined with n_retrys, it fires only after all retries are exhausted — which can be minutes or hours later.
We've seen a payroll job with n_retrys:10 (yes, 10) and alarm_if_fail:1. It retried for 40 minutes before alarming. Production data was delayed by 40 minutes because of a simple missing file that could have been detected instantly.
Rule: for time-sensitive jobs, either reduce retries or build a separate health-check job that alarms if the main job hasn't completed within a window.
Key Takeaway
alarm_if_fail only fires after all retries are used.
For jobs that need immediate attention, set n_retrys:0 or use custom notification.
Test your alerting pipeline — verify alarms reach the right people.

HA architecture for fault tolerance

At the infrastructure level, AutoSys supports high availability through the dual Event Server architecture. For mission-critical batch environments, this is non-negotiable.

The setup involves two AutoSys instances: a primary and a shadow (standby) Event Server. They share a common file system (NFS) where the AutoSys database and binaries are stored. The shadow Event Server monitors the primary via a heartbeat. If the primary becomes unreachable, the shadow promotes itself to active within a configurable timeout (default is typically 5 minutes).

Important: the shadow is not an active-active cluster. Only one Event Processor runs jobs at a time. The shadow is a cold standby — it must be ready to take over but does not process jobs while the primary is healthy.

Failover is automatic, but not instantaneous. During the promotion period, no jobs are scheduled, no events are processed. If the failover happens during a critical window, that gap can cause SLAs to be missed. Consider scheduling maintenance windows around failover testing.

ha_check.shBASH
1
2
3
4
5
6
7
8
9
10
11
# Check which Event Server is currently primary
autoflags -a | grep -i 'primary\|shadow\|active'

# Verify shadow is in sync
autoflags -a | grep -i 'shadow\|standby'

# Check Event Processor status (should be RUNNING on primary)
chk_auto_up -A

# In a dual-server setup, this also shows shadow status
# chk_auto_up -A -S SHADOW_INSTANCE
Output
AutoSys Instance: ACE
Event Server Role: PRIMARY (active)
Shadow Status: IN_SYNC
Event Processor: RUNNING
Think of HA like a co-pilot
  • Primary Event Server actively schedules and runs all jobs.
  • Shadow Event Server watches the primary's heartbeat; it does not schedule jobs.
  • On failure, the shadow promotes itself, reads the shared database, and starts processing.
  • The shared file system must be highly available itself — if NFS goes down, failover fails.
  • Failover takes time (typically 1–5 minutes) — jobs scheduled in that window are delayed.
Production Insight
HA failover is only as reliable as the shared file system. We've seen cases where the NFS mount hung, causing both primary and shadow to assume the other is dead — a split-brain scenario.
AutoSys does not have built-in split-brain prevention. If both instances think they're primary, you'll get duplicate job executions and database corruption.
Rule: use an HA-aware file system (e.g., GPFS, NetApp SnapMirror) and test failover quarterly to ensure the shadow is in sync and promotion works cleanly.
Key Takeaway
Dual Event Server HA protects against AutoSys server failure.
The shadow is a cold standby — not active-active.
Test failover regularly. Ensure shared file system is also HA. Without testing, the HA setup is just false confidence.

Recovery jobs and manual intervention patterns

Even with automatic retries and box_terminators, some failures require human intervention. Recovery jobs are specially designed jobs that repair the state after a failure and allow the pipeline to resume from a clean point.

Common recovery patterns
  • Rollback jobs: Reverse the effects of a partially completed batch (e.g., delete inserted rows, restore files from backup).
  • Re-run jobs: A job that reinitialises the pipeline after a failure — often a wrapper that truncates and re-imports data.
  • Compensation jobs: Run after a failure to fix data integrity issues before the next cycle.
  • Manual restart procedures: Documented steps to use sendevent to reset job statuses and re-trigger the box.

When designing recovery, think about idempotency: the recovery job should be safe to run multiple times if the first attempt also fails. Use checkpoints in your scripts: record completion steps in a control table so that rerunning the recovery job doesn't repeat already-completed actions.

A good practice is to separate recovery jobs into their own box with no dependencies on the business-critical timeline. Keep them available for ops to trigger via a JIL override or sendevent.

recovery_job.jilBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
/* Recovery box: triggers after EOD box failure */
insert_job: daily_recovery
group: recovery
job_type: BOX
condition: FAILURE(eod_box)    /* runs only if eod_box fails */
start_times: "06:00"           /* but manual trigger also works via sendevent */

/* Inside the recovery box */
insert_job: rollback_data
box_name: daily_recovery
command: /scripts/rollback.sh $CHECKPOINT $LAST_FAILED_STEP
machine: server01
owner: batch

insert_job: notify_recovery
box_name: daily_recovery
condition: success(rollback_data)
command: /scripts/send_notification.sh "Recovery complete for eod_box"
alarm_if_fail: 1
Make recovery jobs testable in isolation
Don't design recovery jobs that only work when the box is in a specific failure state. Make them able to run standalone with parameters. Use global variables to pass the failure context so ops can trigger exactly what's needed.
Production Insight
Recovery jobs that are not idempotent are dangerous. If a rollback job fails mid-way and runs again, it might try to drop a table that's no longer there.
The worst case we've seen: a compensation job inserted duplicate records because its script didn't check if the fix had already been applied. The ops team ran it three times, each adding more duplicates. By the time they noticed, reconciliation took three days.
Rule: every recovery script must be idempotent. Use a control table to record progress. And never run recovery jobs blindly — log what they do and let humans review before the next cycle.
Key Takeaway
Recovery jobs fix state after failures.
Make them idempotent: running twice should be safe.
Separate recovery into its own box and document manual trigger steps.
Without idempotent recovery, you'll make production problems worse, not better.

Replication Strategies That Don't Lie — Full vs Partial vs Shadow

Competitors will tell you replication is about copying data. That's like saying a parachute is about fabric. You need to know which failure you're surviving before you pick a strategy.

Full replication means every node carries the entire job history and state. It's expensive. It's slow. But when a primary AutoSys agent drops dead, failover is instant. No context loss. The tradeoff is network chatter and storage bloat. Don't use this for ephemeral jobs.

Partial replication is smarter. You replicate job definitions and critical state (like last run timestamp, exit code) to a secondary agent. Active jobs get mirrored in real-time; idle jobs just carry a pointer. This cuts overhead by 60-70% in most shops.

Shadowing (passive replication) keeps a standby agent that receives checkpoint data but never processes. It's a warm spare. When the primary fails, shadowing restores from the last checkpoint — you lose any in-flight work. Fine for batch windows with no mid-job dependencies. Bad for real-time pipelines.

Active replication runs two agents processing the same job stream. Both acknowledge completion. If one drops, the other holds the state. Double the resource burn, zero recovery time. Use it only for jobs where a second of downtime costs a thousand dollars.

Pick your poison based on your recovery time objective, not your budget's comfort.

ReplicationPolicy.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// io.thecodeforge — devops tutorial

agent_replication:
  primary: "agnt-prd-payroll-01"
  secondary: "agnt-prd-payroll-02"
  strategy: partial
  sync_interval: 30s  # every 30 seconds push state
  sync_on_job_end: true

shadow_agent:
  enabled: true
  passive: true
  failover_on_missed_heartbeat: 2  # misses before takeover
  checkpoint_restore: true
  lost_in_flight: accepted
Output
WARN: Shadow agent will lose state for jobs running during failure.
→ Configure checkpoint interval ≤ 5s for long-running ETLs.
Production Trap:
Partial replication with sync_on_job_end only works if your job completes. A hung job never syncs. Add max_run_alarm or your standbys will be blind.
Key Takeaway
Your replication strategy is a direct bet on how much downtime you can eat. Full = zero, expensive. Partial = seconds, cheap. Shadow = minutes, cheapest. Choose by RTO, not by habit.

Fault Detection and Recovery — What Your Monitoring Dashboard Won't Tell You

Detection isn't an alert. It's a protocol. Most teams set a ping alarm on the AutoSys agent and call it done. That catches a dead box but misses the silent killer: the agent that's alive but stuck in a zombie job loop.

Your detection layer needs three signals: heartbeat from the agent, job execution lag, and agent CPU/memory creep. If the agent's heart beats but it hasn't started a scheduled job in 5 minutes, that's a fault. If it's consuming 90% RAM but completing jobs, that's a degradation — not a failure yet, but you're on the clock.

Recovery starts the second you detect. Don't wait for human approval. Automate: kill the stuck PID, restart the agent, re-run the orphaned jobs. Use alarm_if_fail to flag only the failures that survive three retries. Everything else is noise.

For recovery jobs, pattern is simple: a dedicated recovery job box that triggers on a box_terminator exit code or a missed heartbeat. That recovery box calls a shell to restart the agent and re-queue critical jobs. Log every action. You need the trail when the incident post-mortem asks "who did what."

Manual intervention is the last resort — reserve it for scenarios where automated recovery would corrupt data: incomplete file transfers, partially written database loads. For everything else, automate until it hurts.

AutoRecoveryAgent.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// io.thecodeforge — devops tutorial

insert_job: recovery_payroll_agent
job_type: BOX
box_terminator: yes

insert_job: restart_agent
job_type: CMD
box_name: recovery_payroll_agent
command: /opt/autosys/bin/autorestart agnt-prd-payroll-01
condition: s(job_queue_backlog) > 10 AND s(agent_cpu) > 80
alarm_if_fail: y

insert_job: requeue_missed_jobs
job_type: CMD
box_name: recovery_payroll_agent
command: /opt/autosys/bin/sendevent -E FORCE_STARTJOB -J "payroll_etl_01"
condition: e(restart_agent) AND s(payroll_etl_01) != SUCCESS
Output
2025-04-15 02:14:33 AUTOSYS EVENT: agent_payroll_queue depth = 14
2025-04-15 02:14:34 RECOVERY TRIGGERED: restart_agent → status SUCCESS
2025-04-15 02:14:35 RECOVERY: requeue_missed_jobs triggered
payroll_etl_01 FORCE_STARTJOB sent
Senior Shortcut:
Don't alert on every restart. Alert only on restarts that happen more than 3 times in a rolling hour. That's a pattern, not a hiccup.
Key Takeaway
Detection = heartbeat + execution lag + resource creep. Recovery = automated kill, restart, re-queue. Manual only when data integrity is at stake.

Overview

AutoSys fault tolerance is not merely about configuring retries or alarms—it's about designing a system that degrades gracefully under failure. This guide dissects the patterns that prevent silent validation failures and ensure recoverability without human babysitting. You'll learn why a box_terminator job exists to stop cascading errors, how n_retrys protects transient faults while avoiding infinite loops, and when alarm_if_fail should wake an operator versus signal a normal retry. We cover HA architectures that replicate job definitions across schedulers, recovery jobs that replay failed workflows with idempotency guards, and replication strategies (full, partial, shadow) that trade off consistency for uptime. The goal: move from reactive firefighting to proactive fault isolation. Each topic follows a "WHY before HOW" approach, starting with the failure mode you're trying to avoid, then showing the YAML configuration that solves it—without assuming your monitoring dashboard will tell you the whole truth.

job_overview_example.ymlYAML
1
2
3
4
5
6
7
8
9
// io.thecodeforge — devops tutorial
// minimal box_terminator setup showing fault boundary
box_terminator: yes
on_failure: stop_box
n_retrys: 3
retry_interval: 60
alarm_if_fail: yes
notification: "ops@thecodeforge.io"
recovery_job: "RECOVER_PAYMENT_FLOW"
Output
Box will retry 3 times, then stop the entire box, fire an alarm, and queue a recovery job.
Production Trap:
Without a box_terminator, a single failing job can keep retrying forever while downstream jobs silently wait—creating an invisible deadlock that no dashboard will flag.
Key Takeaway
Always pair n_retrys with a box_terminator to cap retries and prevent silent failures from freezing the dependency graph.

2.2. RestController and External API Caller

Your AutoSys jobs often trigger external REST APIs, but a slow or failing API can cause job hangs, retry storms, and stale locks. A dedicated RestController wrapper (in Java or Python) separates the HTTP call logic from the job script, enforcing timeouts, response validation, and structured error codes. For example, a job that calls a payment gateway should use a RestController with a 10-second timeout and a 503 fallback. The job's exit_code maps directly to the RestController's response: 0 for success, 1 for timeout, 2 for invalid response. This turns ambiguous network failures into actionable job statuses. External API callers must also implement idempotency keys to prevent duplicate charges on retry. Combine this with AutoSys n_retrys set to 2 (not infinite) so a flaky API doesn't cascade. The RestController becomes the single fault boundary: if the API is down, your job fails fast instead of hanging until the scheduler kill time.

rest_controller_job.ymlYAML
1
2
3
4
5
6
7
8
9
// io.thecodeforge — devops tutorial
// RestController job that fails fast on timeout
job_type: c
command: "python3 rest_controller.py --url https://api.payments.com --timeout 10"
n_retrys: 2
retry_interval: 30
alarm_if_fail: yes
max_exit_success: 1
max_exit_failure: 20
Output
Job exits 0 on success, 1 on timeout (retried), 2+ on hard failure (stops box).
Production Trap:
If your job calls an external API without a RestController timeout, a network partition can hold the job for hours—blocking all downstream jobs and masking the real failure.
Key Takeaway
Wrap every external API call in a RestController with a strict timeout; map exit codes to job statuses so AutoSys can differentiate retriable from fatal failures.
● Production incidentPOST-MORTEMseverity: high

The Night the Payroll Box Ran All Weekend

Symptom
Payroll output was $2.4M off. Downstream jobs completed with success but produced incorrect results. No alarms fired.
Assumption
The team assumed the validation job's failure would stop the box because it was a prerequisite. They didn't set box_terminator.
Root cause
Validation job failed (exit code 1), but without box_terminator:1, the box continued running all other jobs. The remaining jobs used stale data and ran successfully on corrupt input.
Fix
Added box_terminator:1 to the validation job. Also added a success() condition on the first processing job after validation so even if box_terminator is accidentally removed, the condition blocks execution.
Key lesson
  • Always mark validation and gate jobs as box_terminators — a failure there means all downstream work is garbage.
  • Alarm on validation failures at severity CRITICAL — not just on jobs that crash but on jobs that invalidate the data pipeline.
  • Never assume a box failure cascade works without explicit attributes — test the failure scenario in a non-prod environment.
Production debug guideSymptom → Action quick reference for the most common production failures.5 entries
Symptom · 01
Job shows FAILURE, no retry attempted
Fix
Check job definition: autorep -J job_name -q. Verify n_retrys is set. If 0, the job will not retry automatically. Add retry with sendevent -E CHANGE_STATUS -s ON_HOLD -J job_name then update JIL with n_retrys.
Symptom · 02
Job status is FAILURE but box status is RUNNING
Fix
Check if the job has box_terminator:0 (default). If needed, set box_terminator:1 and test. Also check if the box has box_terminator overridden at box level.
Symptom · 03
Job stuck in ACTIVATED status for hours
Fix
Run autorep -J job_name -w to see job details. Check term_run_time — without it, the job will wait forever. Use sendevent -E KILLJOB -J job_name to force-stop.
Symptom · 04
Shadow Event Server never takes over after primary crash
Fix
Check shadow status: autoflags -a | grep shadow. Verify network connectivity and that SHADOW_INSTANCE is configured in autosys.conf. Test failover quarterly.
Symptom · 05
Alarm did not fire on job failure
Fix
Check alarm_if_fail:1 is set on the job. Verify notification rules in WCC or custom alarm scripts. Remember: if n_retrys > 0, the alarm only fires after all retries exhausted.
★ AutoSys Fault Tolerance – Commands & FixesThe commands you need when jobs fail: check status, force retries, kill hung jobs, and verify HA.
Job failed — want to retry manually
Immediate action
Check job status and retry count.
Commands
autorep -J job_name -w
sendevent -E FORCE_START -J job_name
Fix now
If you want to retry after fixing the issue, use sendevent -E CHANGE_STATUS -s ACTIVATED -J job_name after setting ON_HOLD.
Job hung — never completes+
Immediate action
Kill the job and check term_run_time.
Commands
sendevent -E KILLJOB -J job_name
autorep -J job_name -q | grep term_run_time
Fix now
Add term_run_time: 30 to the job definition to prevent future hangs.
Box not stopping on critical failure+
Immediate action
Check if the failing job is a box_terminator.
Commands
autorep -J job_name -q | grep box_terminator
sendevent -E CHANGE_STATUS -s STOP_ON_FAILURE -J box_name
Fix now
Edit the job JIL to set box_terminator:1 and alarm_if_fail:1.
Event Server down, shadow not promoting+
Immediate action
Check shadow status and connectivity.
Commands
autoflags -a | grep -E 'primary|shadow'
chk_auto_up -A -S SHADOW_INSTANCE
Fix now
If shadow not syncing, restart the Event Processor on the shadow: sendevent -E STARTING -S shadow_event_server.
Fault tolerance mechanisms in AutoSys
Fault tolerance mechanismWhat it handlesConfigured where
n_retrysTransient job failures (network blips)Job definition attribute
box_terminatorCritical failure that should stop the whole boxJob definition attribute
term_run_timeHung jobs that never completeJob definition attribute
alarm_if_fail + notificationHuman awareness and responseJob definition attributes
Dual Event Server (HA)AutoSys server/infrastructure failureAutoSys installation config
Remote Agent redundancyAgent machine failureMachine definitions + job failover logic
Recovery jobsPost-failure state repairDedicated JIL definitions + manual trigger

Key takeaways

1
n_retrys handles transient failures automatically
set it on jobs prone to temporary external issues
2
box_terminator
1 stops the entire box when a critical job fails — use it on validation and pre-requisite checks
3
term_run_time prevents hung jobs from blocking everything downstream indefinitely
4
alarm_if_fail only fires after all retries are exhausted
adjust retry count or use custom alerting for time-sensitive jobs
5
Infrastructure-level fault tolerance requires the dual Event Server HA setup
test failover regularly
6
Recovery jobs must be idempotent and testable in isolation
document manual restart procedures clearly

Common mistakes to avoid

6 patterns
×

Setting n_retrys too high (e.g., 10)

Symptom
When the underlying issue is permanent, all retries just delay the failure alarm by hours. The job keeps retrying long after the pipeline should have been halted for investigation.
Fix
Set n_retrys to 2 or 3 maximum. Use a separate health-check job to detect persistent issues early.
×

Not using box_terminator on validation jobs

Symptom
Downstream jobs run with bad input and produce corrupt results. The box shows SUCCESS because the failing job didn't stop it.
Fix
Always set box_terminator:1 on data validation, prerequisite extraction, and configuration load jobs.
×

Treating n_retrys as a substitute for fixing flaky scripts

Symptom
The job retries successfully every night, but no one investigates the root cause. Over time, the problem worsens into a hard failure that brings down the whole pipeline.
Fix
Monitor retry rates using AutoSys reports or custom scripts. Any job that retries more than once a week should be investigated and fixed.
×

Not testing HA failover

Symptom
When the primary Event Server fails, the shadow does not promote properly, or jobs stop being scheduled. Many teams discover their shadow Event Server isn't actually in sync only when they need it.
Fix
Test failover quarterly in a non-production environment that mirrors production. Verify shadow status weekly with autoflags -a.
×

Ignoring idempotency when using retries

Symptom
On retry, the job inserts duplicate database rows, appends to log files without checking, or sends duplicate API calls. Data corruption propagates downstream.
Fix
Make all job scripts idempotent: use upsert logic, idempotency tokens, or checkpoints. Ensure that running the same job twice produces the same final state.
×

Over-using alarm_if_fail on every job

Symptom
The on-call engineer receives dozens of pages for transient failures that auto-recover. Desensitisation leads to ignored alarms and missed critical failures.
Fix
Only set alarm_if_fail:1 on jobs where a failure requires human intervention. For retryable jobs, the alarm after exhaust is sufficient. Use different severity levels for different job classes.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01SENIOR
How does n_retrys work in AutoSys and what are its limitations?
Q02SENIOR
What is box_terminator and when would you use it?
Q03SENIOR
What is the difference between fault tolerance at the job level and at t...
Q04SENIOR
If a validation job fails, how do you ensure none of the downstream jobs...
Q05SENIOR
How do you verify that AutoSys HA is working correctly?
Q06SENIOR
How would you design a recovery plan for a critical batch box that faile...
Q01 of 06SENIOR

How does n_retrys work in AutoSys and what are its limitations?

ANSWER
n_retrys specifies the number of automatic retries after a job failure. Each retry is a full re-execution of the job script. The alarm (if alarm_if_fail:1) fires only after all retries are exhausted. Limitations: it is not suitable for jobs that are not idempotent because retries can cause duplicate side effects. It also masks persistent issues — if a job retries successfully each time, the root cause is never investigated. The retry delay can be problematic for time-sensitive jobs.
FAQ · 7 QUESTIONS

Frequently Asked Questions

01
How does n_retrys work in AutoSys?
02
What is box_terminator in AutoSys?
03
How do I prevent downstream jobs from running after a failure?
04
How do I test AutoSys HA failover?
05
Should I set n_retrys on every job?
06
What is a recovery job and when should I use one?
07
How do I handle a job that is stuck in STARTING status?
COMPLETE GUIDE
The Complete AutoSys Workload Automation Guide for Engineers →

JIL syntax, sendevent, autorep, box jobs, file watchers, scheduling, HA, security, cloud workload automation, and 22 interview questions — the definitive AutoSys reference for production engineers.

N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Lessons pulled from things that broke in production.

Follow
Verified
production tested
May 24, 2026
last updated
1,554
articles · all by Naren
🔥

That's AutoSys. Mark it forged?

9 min read · try the examples if you haven't

Previous
AutoSys Alarms and Notifications
25 / 30 · AutoSys
Next
AutoSys Job Failure Handling and Restart