AutoSys box_terminator — Prevent Silent Validation Fails
A failed validation job let downstream jobs run on corrupt data — $2.4M payroll error.
20+ years shipping production infrastructure and CI/CD at scale. Lessons pulled from things that broke in production.
- n_retrys: auto-retry a failed job up to N times for transient failures
- box_terminator: stop the entire box when a critical job fails
- Dual Event Server HA: infrastructure-level failover for the AutoSys scheduler
- alarm_if_fail: notify only after all retries exhausted — don't wake ops for blips
- term_run_time: kill hung jobs so downstream isn't blocked indefinitely
- Biggest mistake: setting n_retrys too high masks permanent failures and delays escalation by hours
Fault tolerance in AutoSys is like building redundancy into your plans. If the main road is blocked (job fails), you want automatic detours (retries), emergency alerts (alarms), and a backup plan (recovery jobs). Good fault tolerance means problems get handled automatically at 3 AM without waking anyone up.
AutoSys fault tolerance recovery is the safety net that catches batch jobs when your primary strategy fails. Without it, a single failed dependency can silently corrupt a month-end report or leave a critical data feed in an inconsistent state. Developers need it because AutoSys jobs don't restart themselves, and your monitoring dashboard won't tell you when a job finished but produced garbage—only explicit recovery patterns like box_terminator, n_retrys, and alarm_if_fail prevent those silent validation failures from becoming production fires.
Why AutoSys box_terminator Exists — Stop Silent Validation Fails
AutoSys box_terminator is a job attribute that forces a box job to abort immediately when any child job inside it fails, rather than continuing to run and potentially completing with a 'SUCCESS' status despite underlying failures. Without it, a box job aggregates child statuses using a default logic that can mask failures — a box can finish SUCCESS even when critical steps inside it have failed, as long as the box's own exit code or status logic doesn't detect it. This is the core mechanic: box_terminator=true makes the box fail atomically on first child failure, preventing silent data corruption or downstream job triggers based on a false positive.
In practice, box_terminator works by setting the box job's status to FAILURE as soon as any child job exits with a non-zero status, and then terminating all remaining running children (SIGTERM, then SIGKILL after a grace period). This is not a soft stop — it's a hard abort. Key properties: it's a boolean attribute (default false), it applies only to box jobs, and it does not wait for children to finish gracefully. The termination order is immediate: the box fails, then children are killed. This means you lose any cleanup logic in downstream children — design for that.
Use box_terminator=true in any box where a single failure invalidates the entire batch — for example, an ETL pipeline where a failed extract step means the load step would process stale or partial data. It matters because without it, you get 'silent validation fails': the box reports SUCCESS, downstream jobs trigger on a 'successful' box, and data quality issues surface hours later in production reports. Real systems use this to enforce transactional boundaries across job steps.
Automatic retry with n_retrys
The simplest fault tolerance mechanism. n_retrys tells AutoSys to automatically rerun a failed job N times before declaring it a final FAILURE. This handles transient failures like brief network blips or temporary database connection issues.
Here's the thing: each retry is a full new attempt — the job script runs again from scratch. AutoSys doesn't resume from where it left off. So if your job is not idempotent, retries can cause data duplication or corruption. For example, an INSERT without a uniqueness check will happily create duplicate rows on each retry. Make sure your scripts handle re-entry safely: use idempotency keys, checkpoints, or database MERGE (upsert) logic.
When setting n_retrys, choose a number that matches the expected transient window. If network blips last ~30 seconds, and your job runs in 2 minutes, n_retrys: 3 gives about 6 minutes of recovery time. That's enough for most intermittent issues without delaying the pipeline too much.
box_terminator — stopping the box on critical failure
In a BOX with multiple independent jobs, a failure in one job normally leaves other jobs to continue. If one job's failure should stop everything — because its output is required or its failure invalidates all subsequent work — mark it as a box_terminator.
When a job with box_terminator:1 fails, AutoSys immediately transitions the parent box to FAILURE. All currently pending inner jobs are skipped (their status becomes TERMINATED). Any jobs already running are killed. This prevents wasted compute on bad data and reduces the time to detect and recover.
- Data validation jobs (schema checks, referential integrity)
- Prerequisite extraction jobs (if upstream source is unavailable)
- Configuration or lookup table loads (everything depends on them)
Do not use box_terminator on jobs that have graceful degradation paths. If a downstream job can handle missing data (e.g., produce a partial report with a warning), let it run.
alarm_if_fail and notification — when to wake someone up
alarm_if_fail:1 tells AutoSys to trigger an alarm when a job fails. But the timing matters: if you also have n_retrys > 0, the alarm only fires after all retries are exhausted. That's the right behaviour for transient failures — you don't want the on-call engineer paged for a 30-second network glitch.
However, some jobs should always alarm on the first failure, regardless of retries. For those, consider splitting the job: set a dummy pre-step that does the retry logic, and the main job with alarm_if_fail:1 and n_retrys:0. Or use a different notification mechanism: a custom script that sends a page on exit code != 0.
In AutoSys, the alarm mechanism is typically configured in WCC or via an external event handler. The job attribute alarm_if_fail sets a flag that AutoSys propagates to the event server. Make sure your notification system (email, SMS, PagerDuty) is subscribed to these events. Many teams set up automated alerting rules that trigger on job status FAILURE, but if those rules don't respect the retry state, they may fire on every transient blip.
HA architecture for fault tolerance
At the infrastructure level, AutoSys supports high availability through the dual Event Server architecture. For mission-critical batch environments, this is non-negotiable.
The setup involves two AutoSys instances: a primary and a shadow (standby) Event Server. They share a common file system (NFS) where the AutoSys database and binaries are stored. The shadow Event Server monitors the primary via a heartbeat. If the primary becomes unreachable, the shadow promotes itself to active within a configurable timeout (default is typically 5 minutes).
Important: the shadow is not an active-active cluster. Only one Event Processor runs jobs at a time. The shadow is a cold standby — it must be ready to take over but does not process jobs while the primary is healthy.
Failover is automatic, but not instantaneous. During the promotion period, no jobs are scheduled, no events are processed. If the failover happens during a critical window, that gap can cause SLAs to be missed. Consider scheduling maintenance windows around failover testing.
- Primary Event Server actively schedules and runs all jobs.
- Shadow Event Server watches the primary's heartbeat; it does not schedule jobs.
- On failure, the shadow promotes itself, reads the shared database, and starts processing.
- The shared file system must be highly available itself — if NFS goes down, failover fails.
- Failover takes time (typically 1–5 minutes) — jobs scheduled in that window are delayed.
Recovery jobs and manual intervention patterns
Even with automatic retries and box_terminators, some failures require human intervention. Recovery jobs are specially designed jobs that repair the state after a failure and allow the pipeline to resume from a clean point.
- Rollback jobs: Reverse the effects of a partially completed batch (e.g., delete inserted rows, restore files from backup).
- Re-run jobs: A job that reinitialises the pipeline after a failure — often a wrapper that truncates and re-imports data.
- Compensation jobs: Run after a failure to fix data integrity issues before the next cycle.
- Manual restart procedures: Documented steps to use sendevent to reset job statuses and re-trigger the box.
When designing recovery, think about idempotency: the recovery job should be safe to run multiple times if the first attempt also fails. Use checkpoints in your scripts: record completion steps in a control table so that rerunning the recovery job doesn't repeat already-completed actions.
A good practice is to separate recovery jobs into their own box with no dependencies on the business-critical timeline. Keep them available for ops to trigger via a JIL override or sendevent.
Replication Strategies That Don't Lie — Full vs Partial vs Shadow
Competitors will tell you replication is about copying data. That's like saying a parachute is about fabric. You need to know which failure you're surviving before you pick a strategy.
Full replication means every node carries the entire job history and state. It's expensive. It's slow. But when a primary AutoSys agent drops dead, failover is instant. No context loss. The tradeoff is network chatter and storage bloat. Don't use this for ephemeral jobs.
Partial replication is smarter. You replicate job definitions and critical state (like last run timestamp, exit code) to a secondary agent. Active jobs get mirrored in real-time; idle jobs just carry a pointer. This cuts overhead by 60-70% in most shops.
Shadowing (passive replication) keeps a standby agent that receives checkpoint data but never processes. It's a warm spare. When the primary fails, shadowing restores from the last checkpoint — you lose any in-flight work. Fine for batch windows with no mid-job dependencies. Bad for real-time pipelines.
Active replication runs two agents processing the same job stream. Both acknowledge completion. If one drops, the other holds the state. Double the resource burn, zero recovery time. Use it only for jobs where a second of downtime costs a thousand dollars.
Pick your poison based on your recovery time objective, not your budget's comfort.
Fault Detection and Recovery — What Your Monitoring Dashboard Won't Tell You
Detection isn't an alert. It's a protocol. Most teams set a ping alarm on the AutoSys agent and call it done. That catches a dead box but misses the silent killer: the agent that's alive but stuck in a zombie job loop.
Your detection layer needs three signals: heartbeat from the agent, job execution lag, and agent CPU/memory creep. If the agent's heart beats but it hasn't started a scheduled job in 5 minutes, that's a fault. If it's consuming 90% RAM but completing jobs, that's a degradation — not a failure yet, but you're on the clock.
Recovery starts the second you detect. Don't wait for human approval. Automate: kill the stuck PID, restart the agent, re-run the orphaned jobs. Use alarm_if_fail to flag only the failures that survive three retries. Everything else is noise.
For recovery jobs, pattern is simple: a dedicated recovery job box that triggers on a box_terminator exit code or a missed heartbeat. That recovery box calls a shell to restart the agent and re-queue critical jobs. Log every action. You need the trail when the incident post-mortem asks "who did what."
Manual intervention is the last resort — reserve it for scenarios where automated recovery would corrupt data: incomplete file transfers, partially written database loads. For everything else, automate until it hurts.
Overview
AutoSys fault tolerance is not merely about configuring retries or alarms—it's about designing a system that degrades gracefully under failure. This guide dissects the patterns that prevent silent validation failures and ensure recoverability without human babysitting. You'll learn why a box_terminator job exists to stop cascading errors, how n_retrys protects transient faults while avoiding infinite loops, and when alarm_if_fail should wake an operator versus signal a normal retry. We cover HA architectures that replicate job definitions across schedulers, recovery jobs that replay failed workflows with idempotency guards, and replication strategies (full, partial, shadow) that trade off consistency for uptime. The goal: move from reactive firefighting to proactive fault isolation. Each topic follows a "WHY before HOW" approach, starting with the failure mode you're trying to avoid, then showing the YAML configuration that solves it—without assuming your monitoring dashboard will tell you the whole truth.
box_terminator, a single failing job can keep retrying forever while downstream jobs silently wait—creating an invisible deadlock that no dashboard will flag.2.2. RestController and External API Caller
Your AutoSys jobs often trigger external REST APIs, but a slow or failing API can cause job hangs, retry storms, and stale locks. A dedicated RestController wrapper (in Java or Python) separates the HTTP call logic from the job script, enforcing timeouts, response validation, and structured error codes. For example, a job that calls a payment gateway should use a RestController with a 10-second timeout and a 503 fallback. The job's exit_code maps directly to the RestController's response: 0 for success, 1 for timeout, 2 for invalid response. This turns ambiguous network failures into actionable job statuses. External API callers must also implement idempotency keys to prevent duplicate charges on retry. Combine this with AutoSys n_retrys set to 2 (not infinite) so a flaky API doesn't cascade. The RestController becomes the single fault boundary: if the API is down, your job fails fast instead of hanging until the scheduler kill time.
The Night the Payroll Box Ran All Weekend
success() condition on the first processing job after validation so even if box_terminator is accidentally removed, the condition blocks execution.- Always mark validation and gate jobs as box_terminators — a failure there means all downstream work is garbage.
- Alarm on validation failures at severity CRITICAL — not just on jobs that crash but on jobs that invalidate the data pipeline.
- Never assume a box failure cascade works without explicit attributes — test the failure scenario in a non-prod environment.
autorep -J job_name -q. Verify n_retrys is set. If 0, the job will not retry automatically. Add retry with sendevent -E CHANGE_STATUS -s ON_HOLD -J job_name then update JIL with n_retrys.box_terminator:0 (default). If needed, set box_terminator:1 and test. Also check if the box has box_terminator overridden at box level.autorep -J job_name -w to see job details. Check term_run_time — without it, the job will wait forever. Use sendevent -E KILLJOB -J job_name to force-stop.autoflags -a | grep shadow. Verify network connectivity and that SHADOW_INSTANCE is configured in autosys.conf. Test failover quarterly.alarm_if_fail:1 is set on the job. Verify notification rules in WCC or custom alarm scripts. Remember: if n_retrys > 0, the alarm only fires after all retries exhausted.autorep -J job_name -wsendevent -E FORCE_START -J job_namesendevent -E CHANGE_STATUS -s ACTIVATED -J job_name after setting ON_HOLD.Key takeaways
Common mistakes to avoid
6 patternsSetting n_retrys too high (e.g., 10)
Not using box_terminator on validation jobs
Treating n_retrys as a substitute for fixing flaky scripts
Not testing HA failover
autoflags -a.Ignoring idempotency when using retries
Over-using alarm_if_fail on every job
Interview Questions on This Topic
How does n_retrys work in AutoSys and what are its limitations?
Frequently Asked Questions
JIL syntax, sendevent, autorep, box jobs, file watchers, scheduling, HA, security, cloud workload automation, and 22 interview questions — the definitive AutoSys reference for production engineers.
20+ years shipping production infrastructure and CI/CD at scale. Lessons pulled from things that broke in production.
That's AutoSys. Mark it forged?
9 min read · try the examples if you haven't