Home DevOps AutoSys Fault Tolerance and Recovery — Building Resilient Batch Workflows

AutoSys Fault Tolerance and Recovery — Building Resilient Batch Workflows

Where developers are forged. · Structured learning · Free forever.
📍 Part of: AutoSys → Topic 25 of 30
Learn AutoSys fault tolerance patterns: n_retrys, box_terminator, HA setup, restart procedures, and recovery strategies for failed batch workflows in production environments.
🔥 Advanced — solid DevOps foundation required
In this tutorial, you'll learn:
  • n_retrys handles transient failures automatically — set it on jobs prone to temporary external issues
  • box_terminator: 1 stops the entire box when a critical job fails — use it on validation and pre-requisite checks
  • term_run_time prevents hung jobs from blocking everything downstream indefinitely
✦ Plain-English analogy ✦ Real code with output ✦ Interview questions
⚡ Quick Answer
Fault tolerance in AutoSys is like building redundancy into your plans. If the main road is blocked (job fails), you want automatic detours (retries), emergency alerts (alarms), and a backup plan (recovery jobs). Good fault tolerance means problems get handled automatically at 3 AM without waking anyone up.

Enterprise batch workflows run overnight when no one is watching. The jobs that matter most — payroll, settlement, reconciliation — are the ones where failures are most costly. Building fault tolerance into your AutoSys design means many failures recover automatically, and when they don't, the right people are notified with enough context to fix things quickly.

Automatic retry with n_retrys

The simplest fault tolerance mechanism. n_retrys tells AutoSys to automatically rerun a failed job N times before declaring it a final FAILURE. This handles transient failures like brief network blips or temporary database connection issues.

retry_config.jil · BASH
123456789101112
insert_job: extract_market_data
job_type: CMD
command: /scripts/extract_market.sh
machine: data-server-01
owner: batchuser
date_conditions: 1
days_of_week: all
start_times: "18:00"
n_retrys: 3            /* retry up to 3 times after initial failure = 4 total attempts */
alarm_if_fail: 1       /* alarm only after all retries exhausted */
term_run_time: 45      /* kill if running over 45 minutes */
std_err_file: /logs/autosys/extract_market_data.err
▶ Output
/* Execution sequence on failure:
18:00:01 — Attempt 1: FAILURE (exit code 1)
18:00:31 — Retry 1: FAILURE (exit code 1)
18:01:01 — Retry 2: SUCCESS (exit code 0)
18:01:01 — extract_market_data: SUCCESS — downstream jobs proceed */

box_terminator — stopping the box on critical failure

In a BOX with multiple independent jobs, a failure in one job normally leaves other jobs to continue. If one job's failure should stop everything, mark it as a box_terminator.

box_terminator.jil · BASH
1234567891011
insert_job: validate_input_data
job_type: CMD
box_name: eod_box
command: /scripts/validate.sh
machine: server01
owner: batch
box_terminator: 1          /* if this fails, the entire box fails immediately */
alarm_if_fail: 1

/* Without box_terminator: other jobs in the box would continue even after validate fails */
/* With box_terminator: box immediately moves to FAILURE, all pending inner jobs skip */
🔥
Put validation jobs as box_terminatorsData validation jobs are ideal box_terminator candidates. If input data is invalid, there's no point running any of the downstream processing jobs — they'd produce bad output. Mark the validation job as box_terminator: 1 to stop the entire box immediately on validation failure.

HA architecture for fault tolerance

At the infrastructure level, AutoSys supports high availability through the dual Event Server architecture. For mission-critical batch environments, this is non-negotiable.

ha_check.sh · BASH
1234567891011
# Check which Event Server is currently primary
autoflags -a | grep -i 'primary\|shadow\|active'

# Verify shadow is in sync
autoflags -a | grep -i 'shadow\|standby'

# Check Event Processor status (should be RUNNING on primary)
chk_auto_up -A

# In a dual-server setup, this also shows shadow status
# chk_auto_up -A -S SHADOW_INSTANCE
▶ Output
AutoSys Instance: ACE
Event Server Role: PRIMARY (active)
Shadow Status: IN_SYNC
Event Processor: RUNNING
Fault tolerance mechanismWhat it handlesConfigured where
n_retrysTransient job failures (network blips)Job definition attribute
box_terminatorCritical failure that should stop the whole boxJob definition attribute
term_run_timeHung jobs that never completeJob definition attribute
Dual Event Server (HA)AutoSys server/infrastructure failureAutoSys installation config
Remote Agent redundancyAgent machine failureMachine definitions + job failover logic
alarm_if_fail + notificationHuman awareness and responseJob definition attributes

🎯 Key Takeaways

  • n_retrys handles transient failures automatically — set it on jobs prone to temporary external issues
  • box_terminator: 1 stops the entire box when a critical job fails — use it on validation and pre-requisite checks
  • term_run_time prevents hung jobs from blocking everything downstream indefinitely
  • Infrastructure-level fault tolerance requires the dual Event Server HA setup — test failover regularly

⚠ Common Mistakes to Avoid

  • Setting n_retrys too high (e.g., 10) — if the underlying issue is permanent, all retries just delay the failure alarm by hours
  • Not using box_terminator on validation jobs — downstream jobs run with bad input and produce corrupt results
  • Treating n_retrys as a substitute for fixing flaky scripts — retries mask problems; fix the root cause
  • Not testing HA failover — many teams discover their shadow Event Server isn't actually in sync only when they need it

Interview Questions on This Topic

  • QHow does n_retrys work in AutoSys?
  • QWhat is box_terminator and when would you use it?
  • QWhat is the difference between fault tolerance at the job level and at the infrastructure level in AutoSys?
  • QIf a validation job fails, how do you ensure none of the downstream jobs in the box run?
  • QHow do you verify that AutoSys HA is working correctly?

Frequently Asked Questions

How does n_retrys work in AutoSys?

n_retrys specifies how many automatic retries AutoSys performs after a job fails. With n_retrys: 3, the job runs up to 4 times total: the original attempt plus 3 retries. The alarm only fires (if alarm_if_fail: 1) after all retries are exhausted.

What is box_terminator in AutoSys?

box_terminator: 1 marks a job as the kill switch for its parent box. If this job fails, AutoSys immediately terminates the box and all remaining pending inner jobs. It's ideal for validation or prerequisite jobs whose failure makes all downstream processing meaningless.

How do I prevent downstream jobs from running after a failure?

Use condition: success(upstream_job) on downstream jobs, and/or use box_terminator: 1 on the critical upstream job. With success() conditions, downstream jobs only start when the upstream succeeds. With box_terminator, the entire box stops on failure.

How do I test AutoSys HA failover?

In a test environment, stop the primary Event Server and verify the shadow promotes automatically within the expected time. Check with autoflags -a that the shadow is now the primary, and verify that jobs continue to be scheduled correctly. Document the failover procedure and test it annually in production-equivalent environments.

Should I set n_retrys on every job?

Not necessarily. n_retrys is best for jobs that interface with external systems prone to transient failures (network services, external APIs, databases under load). For jobs with deterministic inputs and outputs, a single failure usually warrants human investigation rather than automatic retry.

🔥
Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

← PreviousAutoSys Alarms and NotificationsNext →AutoSys Job Failure Handling and Restart
Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged