DevOps Beginner

AutoSys Fault Tolerance: 5 Recovery Patterns That Work

Q: How do I prevent downstream jobs from running after a failure?

Two approaches: 1. box_terminator: 1 on the critical upstream job. When it fails, the entire box terminates. All pending downstream jobs skip to TERMINATED. 2. condition: success(upstream_job) on each downstream job. Downstream jobs only start when the upstream succeeds — but this requires maintaining conditions on potentially many jobs. Best practice: Use box_terminator on the first critical validation job. Use success() conditions on jobs that are downstream of the validation but within the same box for additional safety.

Q: How do I test AutoSys HA failover?

In a staging environment: 1. Verify replication is in sync: autoflags -a (look for 'SHADOW STATUS: IN_SYNC') 2. Stop the primary Event Server (gracefully if possible): sendevent -E STOP_DEMON on the primary host, or stop the database listener. 3. Monitor autoflags -a on the shadow — it should promote within 60-90 seconds. 4. Verify Event Processor is running on the new primary: chk_auto_up -A 5. Test job execution: submit a test job and verify it runs correctly. 6. Restore the original primary and re-establish replication (procedure varies by DB). Do this quarterly. Document every step. If you can't test automatically, test manually — but test.

📅 March 19, 2026 ⏱ 3 min read 🎯 Beginner

Where developers are forged. · Structured learning · Free forever.

📍 Part of: AutoSys → Topic 3 of 30

n_retrys masks flaky jobs.

🧑‍💻 Beginner-friendly — no prior DevOps experience needed

In this tutorial, you'll learn

n_retrys masks flaky jobs.

n_retrys handles transient failures — but monitor retry rate. >1% retry means fix the root cause.
box_terminator: 1 on validation jobs stops bad data propagation immediately.
term_run_time prevents infinite hangs. Every external-facing job needs it.

thecodeforge.io

Event Server vs Event Processor

Autosys Event Server Event Processor

✦ Plain-English analogy ✦ Real code with output ✦ Interview questions

⚡Quick Answer

n_retrys: AutoSys retries failed jobs N times. Handles network blips. Default alarm fires only after all retries exhaust.
box_terminator: Stops the entire box when a critical job fails. Use on validation jobs — bad input shouldn't propagate.
term_run_time: Hard kill after N minutes. Prevents hung jobs from blocking downstream workflows forever.
Dual Event Server HA: Automatic failover takes 60-90 seconds. Running jobs continue; new jobs wait.
The 3 AM lesson: Retries without root-cause fixes mask problems. Permanent failures still need humans.

AssumptionThe team assumed the job was stable because it always ended in SUCCESS. They didn't know it was failing 4 times before succeeding. The retries masked a race condition in the extract script.

Root causen_retrys: 5 with alarm_if_fail: 1 — the alarm only fires after all retries exhaust. Since the 4th retry succeeded, the alarm never triggered. The job's success hid the failures from monitoring. The script had a race condition: it deleted a temp table before committing the final write. On a quiet system, the timing worked. Under load, the table disappeared mid-read, causing an error. Retry 4 always worked because by then the load had subsided.

Fix- Changed n_retrys from 5 to 2. Excessive retries mask real problems. - Added monitoring on job attempt count, not just final status. - Fixed the actual bug: commit-before-drop ordering in the script. - Added alert for 'any failure' on critical jobs, regardless of retry exhaustion.

Key Lesson

n_retrys masks failures. Monitor attempt counts, not just final status.If a job regularly needs retries, you have a root cause — fix it, don't retry it.alarm_if_fail fires only after all retries. Use a separate alert for first failure.5 retries exhaust at 5 * retry_interval. That's hours of delay before alarm.

Production Debug Guide

When your recovery strategy doesn't recover

Job retries but still fails after n_retrys — alarm fires hours later→Check retry_interval (default 60s). Total time = n_retrys * retry_interval. Lower n_retrys for fast-to-detect permanent failures. Add separate alert on first failure severity.

Box shows FAILURE but box_terminator job didn't actually fail→Check if any job marked box_terminator failed. autostatus -J BOXNAME shows box_terminator job name. Verify box_terminator: 1 is set on the correct job — not on an optional cleanup job.

Job running for hours, term_run_time set, but no kill→term_run_time kills only at end of term_run_time minutes from job start. If job restarted (retry), timer resets. Check job's actual start time, not scheduled time.

HA failover didn't happen — primary DB down, jobs stuck→Check shadow Event Server replication lag. autoflags -a shows shadow status. If lag > heartbeat interval (default 60s), failover won't promote. Verify tie-breaker reachable.

Enterprise batch workflows run overnight when no one's watching. The jobs that matter most — payroll, settlement, reconciliation — are the ones where failures cost the most.

Here's the problem most teams learn the hard way: retries mask flaky scripts until they don't. Box terminators stop bad data from propagating, but only if you put them in the right place. And HA failover? 60-90 seconds feels fast until it's your 2 AM SLA.

This isn't theory. These are the patterns that actually keep workflows alive when things break.

Automatic retry with n_retrys

The simplest fault tolerance mechanism. n_retrys tells AutoSys to automatically rerun a failed job N times before declaring it a final FAILURE. This handles transient failures like brief network blips or temporary database connection issues.

A hidden detail: n_retrys counts retries after the initial attempt. n_retrys: 3 means up to 4 total runs. The retry interval is controlled by the profile setting 'max_exit' default — usually 60 seconds between attempts.

Warning: alarm_if_fail fires only after ALL retries exhaust. If your job succeeds on retry 3, no alarm ever fires. This is good for transient failures but terrible for masking permanent issues.

retry_config.jil · BASH

123456789101112131415

insert_job: extract_market_data
job_type: CMD
command: /scripts/extract_market.sh
machine: data-server-01
owner: batchuser
date_conditions: 1
days_of_week: all
start_times: "18:00"
n_retrys: 3            /* retry up to 3 times after initial failure = 4 total attempts */
alarm_if_fail: 1       /* alarm only after all retries exhausted */
term_run_time: 45      /* kill if running over 45 minutes */
std_err_file: /logs/autosys/extract_market_data.err

/* To alert on first failure regardless of retries — use separate monitoring */
/* Add a dummy dependency job that detects failure status via autorep */

▶ Output

/* Execution sequence on failure:
18:00:01 — Attempt 1: FAILURE (exit code 1)
18:00:31 — Retry 1: FAILURE (exit code 1)
18:01:01 — Retry 2: SUCCESS (exit code 0)
18:01:01 — extract_market_data: SUCCESS — downstream jobs proceed */

⚠ Retries mask root causes

If your job regularly needs retries to succeed, you don't have a fault tolerance problem — you have a bug. Use n_retrys for occasional network blips. For frequent failures, fix the script. Retries are a bandage, not a cure.

📊 Production Insight

The silent retry trap: a job with n_retrys: 5 fails every night at 2 AM. Retry 3 succeeds at 2:15 AM. The alarm never fires. Success rate in monitoring: 100%. No one knows there's a problem.

Six months later, the root cause (a misconfigured connection pool) gets worse. Now retry 5 fails at 2:30 AM. Alarm fires at 2:31 AM. Incident response at 2:45 AM. Recovery at 4 AM.

The team spent 6 months hiding from a problem they could have fixed in an afternoon.

Diagnosis: grep 'RETRY' $AUTOUSER/out/jobname.log. Count how often retries actually happen. If it's more than 5% of runs, you have a problem.

Rule: n_retrys should succeed on retry 1 >99% of the time. If not, fix the root cause.

🎯 Key Takeaway

n_retrys handles transient failure — network, not logic.

Retry count includes original attempt: n_retrys: 3 = up to 4 runs.

alarm_if_fail fires only after ALL retries exhaust. That's a feature and a trap.

If retries succeed >1% of the time, fix the underlying bug.

What retry configuration should you use?

IfJob talks to external API — occasional timeout

→

Usen_retrys: 2, retry_interval: 30. Two retries cover most blips.

IfJob fails deterministically when data is missing

→

Usen_retrys: 0. Retry won't help — data is still missing. Alert immediately.

IfJob fails 10% of runs, always succeeds on retry 1

→

UseYou have a regression. n_retrys: 1 as bandage, but root-cause the flakiness.

IfJob is idempotent and expensive to re-run

→

Usen_retrys: 1 max. Multiple retries waste compute. Alert on first failure.

box_terminator — stopping the box on critical failure

In a BOX with multiple independent jobs, a failure in one job normally leaves other jobs to continue. That's usually what you want — a reporting job failing shouldn't stop the data load.

But sometimes one job's failure should stop everything. If your validation step says 'input data is corrupt', there's zero point running the 50 downstream jobs. They'll just produce garbage.

box_terminator: 1 marks the kill switch. When that job fails, AutoSys immediately terminates the entire box. All pending inner jobs skip to TERMINATED state. The box status becomes FAILURE immediately — no waiting for other jobs to finish.

box_terminator.jil · BASH

123456789101112131415161718

insert_job: validate_input_data
job_type: CMD
box_name: eod_box
command: /scripts/validate.sh
machine: server01
owner: batch
box_terminator: 1          /* if this fails, the entire box fails immediately */
alarm_if_fail: 1

/* Without box_terminator: other jobs in the box would continue even after validate fails */
/* With box_terminator: box immediately moves to FAILURE, all pending inner jobs skip */

/* Box definition */
insert_job: eod_box
job_type: BOX
owner: batch
date_conditions: 1
start_times: "23:00"

Mental Model

The Circuit Breaker Model

Think of box_terminator as a fuse: it blows once, everything stops, and someone has to investigate before resetting.

Normal failure = other jobs continue. box_terminator failure = whole box stops.
Only one job per box should be box_terminator — usually the first validation job.
Box stays in FAILURE until manually restarted or conditionally cleared.
Downstream jobs move to TERMINATED, not FAILURE. They never attempt to run.

📊 Production Insight

A bank's end-of-day box had 47 jobs. The first job validated input files. It was NOT marked box_terminator. One night, validation failed due to a missing file. The remaining 46 jobs ran anyway on partial data. The reconciliation job succeeded on the partial data — but produced incorrect settlement amounts. The error wasn't caught until customers complained three days later.

Fix: Added box_terminator: 1 to the validation job. Now when input is bad, the box stops immediately. The error is caught in the morning, not three days later.

Diagnosis: Check if your critical validation jobs are box_terminators. grep -i 'box_terminator: 1' *.jil. If not, ask: what's the harm of downstream jobs running on bad data?

Rule: Every box needs exactly one box_terminator: the job that determines whether the rest of the box should run at all.

🎯 Key Takeaway

box_terminator: 1 = kill switch for the entire box.

Use on validation, pre-reqs, and critical path jobs only.

Optional jobs should NOT be box_terminators — they'd stop the box unnecessarily.

TERMINATED ≠ FAILURE. Downstream jobs skip, not fail.

term_run_time — preventing the infinite hang

A job that runs forever is worse than a job that fails. Failing at least triggers alerts and retries. Hanging just blocks everything downstream indefinitely.

term_run_time kills a job after N minutes from its start time. The count begins when the job starts (including retries — each retry resets the timer). When term_run_time expires, AutoSys sends a SIGTERM to the agent. The agent terminates the job process and updates status to TERMINATED.

Crucial difference: TERMINATED is NOT FAILURE. Conditions like success(job) won't trigger on TERMINATED. If you want downstream jobs to run after a timeout, you need condition: status(job) != 'RUNNING' or a custom wrapper script that checks exit codes.

term_run_time_example.jil · BASH

12345678910111213141516

insert_job: nightly_reconcile
job_type: CMD
command: /scripts/reconcile.sh
machine: finance-server
owner: batch
date_conditions: 1
start_times: "23:00"
term_run_time: 390          /* 6.5 hours — kill if still running at 5:30 AM */
run_window: "23:00 - 05:30" /* advisory only — term_run_time does the kill */
alarm_if_fail: 1

/* For downstream jobs that should run even if this times out: */
condition: success(nightly_reconcile) OR status(nightly_reconcile) = 'TERMINATED'

/* Or better: wrapper script */
command: /scripts/reconcile_with_timeout.sh

💡term_run_time kills at the end of the interval

term_run_time: 390 means kill after 390 minutes have elapsed since job start. It does NOT mean 'kill at 5:30 AM'. If the job starts late, the kill time shifts accordingly.

📊 Production Insight

A data warehouse load job had no term_run_time. One night, the database locked due to an uncommitted transaction from a previous job. The load job hung indefinitely — waiting on a lock that would never release. The job stayed in RUNNING for 14 hours. All downstream jobs were blocked. The morning dashboard was empty. At 9 AM, an engineer noticed, killed the job manually, and reran. SLA missed by 6 hours.

Prevention: term_run_time: 180 (3 hours). Even if the job hangs, it dies at 2 AM (assuming 11 PM start). A separate monitoring job alerts on TERMINATED status at 2:05 AM. Engineer wakes up, fixes the root cause (the uncommitted transaction), and reruns — dashboard is ready by 6 AM.

Trade-off: Too short a term_run_time kills jobs that are legitimately slow. Too long misses the SLA. Measure your job's p99 runtime and add 20% buffer.

Rule: Every job that touches external systems needs term_run_time. Default to 4 hours unless you know better.

🎯 Key Takeaway

term_run_time kills hung jobs — prevents indefinite blocking.

Timer resets on each retry. TERMINATED ≠ FAILURE.

Downstream needs explicit OR condition to handle TERMINATED.

Set based on p99 runtime + 20%, not average.

HA architecture for fault tolerance

At the infrastructure level, AutoSys supports high availability through the dual Event Server architecture. For mission-critical batch environments, this is non-negotiable.

How it works: Primary Event Server handles all writes. Shadow Event Server maintains a real-time replica via database replication (Oracle Data Guard, Sybase Replication, etc.). The Event Processor monitors the primary through heartbeat checks (default 60 seconds).

When the heartbeat fails, the Event Processor promotes the shadow to primary. Total downtime: 60-90 seconds. During this window, running jobs continue unaffected. However, no new jobs start. The Event Processor queues events during failover and processes them once the new primary is online.

Critical nuance: Replication lag is your enemy. If the shadow is 5 minutes behind when the primary fails, you lose 5 minutes of events. Those job completions, status changes, and sendevent calls are gone.

ha_check.sh · BASH

123456789101112131415

# Check which Event Server is currently primary
autoflags -a | grep -i 'primary\|shadow\|active'

# Verify shadow is in sync
autoflags -a | grep -i 'shadow\|standby'

# Check Event Processor status (should be RUNNING on primary)
chk_auto_up -A

# Check replication lag (Oracle example)
sqlplus autosys_user @check_lag.sql
SELECT APPLIED_LAG FROM V$DATAGUARD_STATS;

# Manual failover (test only)
sendevent -E SWITCH_TO_SHADOW

▶ Output

AutoSys Instance: ACE
Event Server Role: PRIMARY (active)
Shadow Status: IN_SYNC
Replication Lag: 0 seconds
Event Processor: RUNNING

⚠ HA is not set-and-forget

Test failover quarterly. Monitor replication lag continuously. Document manual failover steps. Without testing, your HA setup is a fiction — you'll discover it's broken only when the primary fails.

📊 Production Insight

A retail company's primary Event Server crashed at 1 AM during the peak holiday batch window. AutoSys tried to fail over. The shadow Event Server had replication lag of 4 hours — a broken network link between DB servers hadn't been noticed. Failover completed, but 4 hours of events were missing. Job statuses were incorrect. Some jobs that had succeeded showed as RUNNING. Others that had failed showed as never attempted.

Recovery took 9 hours. Manual reconciliation of 200+ jobs. The team discovered that no one had tested failover in 18 months.

Fix: - Set up replication lag monitoring with alert at 60 seconds.

- Implement quarterly failover drills in staging.

- Add pre-failover check: autoflags -a must show 'IN_SYNC'.

- Keep documented manual reconciliation procedure for split-brain scenarios.

Trade-off: Active-passive HA wastes a database server 99.9% of the time. Active-active isn't supported. This is the cost of resilience in AutoSys.

Rule: Test HA every 90 days. If you can't, remove HA — it's giving you false confidence.

🎯 Key Takeaway

Dual Event Server HA: automatic failover in 60-90 seconds.

Replication lag kills failover — monitor it.

Running jobs continue; new jobs queue during failover.

Test quarterly. Untested HA is worse than no HA.

🗂 AutoSys Fault Tolerance Mechanisms — Side by Side

What each pattern handles and where it fails

Fault tolerance mechanism	What it handles	What it doesn't handle	Configured where
n_retrys	Transient job failures (network blips, timeouts)	Permanent failures, logic bugs	Job definition attribute
box_terminator	Critical failure that should stop the whole box	Multiple failures — only one job can be terminator	Job definition attribute
term_run_time	Hung jobs that never complete	Slow-but-alive jobs (needs buffer)	Job definition attribute
Dual Event Server (HA)	AutoSys server/infrastructure failure	Agent failure, network partition between sites	AutoSys installation config
alarm_if_fail + notification	Human awareness and response	Automatic recovery — still needs humans	Job definition + external paging

🎯 Key Takeaways

n_retrys handles transient failures — but monitor retry rate. >1% retry means fix the root cause.
box_terminator: 1 on validation jobs stops bad data propagation immediately.
term_run_time prevents infinite hangs. Every external-facing job needs it.
HA failover takes 60-90 seconds and needs quarterly testing. Untested HA is a trap.
Retries mask problems. alert on first failure AND final failure — know the difference.
Terminated ≠ Failed. Downstream jobs need explicit OR conditions to handle timeouts.

⚠ Common Mistakes to Avoid

✕Setting n_retrys too high (e.g., 10)

Symptom

Job fails repeatedly for 30+ minutes before final alarm. Underlying issue is permanent (missing file, bad credentials). All retries just delay the inevitable failure and push downstream jobs into the next day's window.

Fix

n_retrys: 2 for most jobs, 3 max. If a job needs more than 3 retries to succeed, you have a root cause that retries won't solve. Fix the script.

✕Not using box_terminator on validation jobs

Symptom

Validation job fails. Downstream jobs run anyway on bad input. Output data is corrupt. The error isn't caught until downstream consumers report invalid results — often days later.

Fix

Add box_terminator: 1 to the validation job. Add alarm_if_fail: 1 so someone investigates. For optional validation jobs (warnings only), use box_terminator: 0 — they shouldn't stop the box.

✕Treating n_retrys as a substitute for fixing flaky scripts

Symptom

Job succeeds on retry 2 most nights. Team never investigates because 'it works eventually'. Underlying issue gradually worsens. Eventually retry 5 also fails. Incident at 3 AM, but the root cause is now harder to diagnose because the code changed 6 times since the issue started.

Fix

Monitor retry rate. If any job needs a retry >1% of runs, create a ticket to investigate. n_retrys is for rare blips, not routine flakiness.

✕Not testing HA failover

Symptom

Primary Event Server fails. AutoSys attempts failover to shadow. Shadow hasn't received replication updates for 4 hours due to broken network link. Failover completes but job state is hours out of date. Some jobs show incorrect status. Manual reconciliation takes 6+ hours.

Fix

Test failover quarterly in staging. Monitor replication lag continuously (threshold 60 seconds). Document manual failover procedure. Test both automatic and manual failover paths.

✕No term_run_time on external-facing jobs

Symptom

Job calls external API that occasionally hangs. No timeout set. Job runs for 12+ hours, blocking downstream jobs. No one notices until morning when the dashboard is empty. Manual kill, rerun, 6-hour SLA miss.

Fix

Add term_run_time to every job that touches external systems. Set based on p99 runtime + 20% buffer. For critical path, add separate monitoring on TERMINATED status with immediate alert.

Interview Questions on This Topic

QHow does n_retrys work in AutoSys?JuniorReveal
n_retrys specifies how many automatic retries AutoSys performs after a job fails. With n_retrys: 3, the job runs up to 4 times total: the original attempt plus 3 retries. The alarm (alarm_if_fail: 1) fires only after all retries are exhausted. Retry timing is controlled by the profile setting max_exit (default 60 seconds between attempts). If a job succeeds on any retry, AutoSys treats it as SUCCESS and continues normally — no alarm fires. Key nuance: n_retrys is for transient failures (network blips, temporary DB locks). It should NOT be used to mask flaky scripts. If a job regularly needs retries, fix the root cause.
QWhat is box_terminator and when would you use it?Mid-levelReveal
box_terminator: 1 marks a job as a kill switch for its parent box. If that job fails, AutoSys immediately terminates the entire box. All pending inner jobs move to TERMINATED status without running. The box status becomes FAILURE. Use box_terminator on jobs whose failure makes the rest of the box meaningless. Examples: data validation jobs, prerequisite checks, file existence checks, authentication jobs. Do NOT use box_terminator on optional jobs, cleanup jobs, or jobs that can fail without impacting correctness. Only one job per box should typically be the terminator — the earliest critical check.
QWhat is the difference between fault tolerance at the job level and at the infrastructure level in AutoSys?SeniorReveal
Job-level fault tolerance handles individual job failures: n_retrys (automatic retries), term_run_time (timeout kills), box_terminator (box-level stop). These are configured per job and recover from transient issues without human intervention. Infrastructure-level fault tolerance handles AutoSys server failures: dual Event Server HA with automatic failover. When the primary Event Server fails, the shadow promotes automatically within 60-90 seconds. Running jobs continue; new jobs queue during failover. Job-level tolerates script failures. Infrastructure-level tolerates server crashes. They're complementary: you need both for a resilient batch environment. Job-level handles daily flakiness. Infrastructure-level handles the once-a-year server failure.
QIf a validation job fails, how do you ensure none of the downstream jobs in the box run?Mid-levelReveal
Two complementary approaches: 1. Use box_terminator: 1 on the validation job. When validation fails, AutoSys immediately terminates the entire box. All pending jobs (including downstream) move to TERMINATED without running. This is the cleanest approach. 2. Use condition: success(validation) on every downstream job. This works but requires maintaining the condition on potentially dozens of jobs, and jobs already running when validation fails would continue. Best practice: box_terminator: 1 on validation + condition: success(validation) on downstream jobs that absolutely require validation. The box_terminator handles the box-level stop; conditions add an extra safety layer.
QHow do you verify that AutoSys HA is working correctly?SeniorReveal
Verification steps: 1. Check replication lag: autoflags -a | grep shadow. Lag must be near zero (under 5 seconds). For primary databases, query replication-specific views (V$DATAGUARD_STATS for Oracle). 2. Test failover in staging quarterly: Stop the primary Event Server (database or listener). Verify shadow promotes automatically within expected time (60-90 seconds). Verify the Event Processor reconnects. Verify jobs continue or resume correctly. 3. Verify tie-breaker configuration: In split-brain scenarios, the tie-breaker must be reachable from both servers. Test with network partition simulations. 4. Monitor continuously: Set up alerting on replication lag (critical at 60 seconds). Alert on any automatic failover — it should trigger an immediate post-mortem. 5. Document manual failover procedure: Use sendevent -E SWITCH_TO_SHADOW. Test this path too. If you can't or won't test regularly, don't run HA. Untested HA is worse than no HA — it gives false confidence and breaks in new, untested ways during actual failures.

Frequently Asked Questions

How does n_retrys work in AutoSys?

n_retrys specifies how many automatic retries AutoSys performs after a job fails. With n_retrys: 3, the job runs up to 4 times total: the original attempt plus 3 retries. The alarm only fires (if alarm_if_fail: 1) after all retries are exhausted.

Retry interval: controlled by the 'max_exit' profile setting (default 60 seconds). If a job succeeds on any retry, AutoSys treats it as SUCCESS and does not raise an alarm.

Important: n_retrys counts retries AFTER the first failure. n_retrys: 0 means no retries (fail once, alarm immediately).

What is box_terminator in AutoSys?

box_terminator: 1 marks a job as the kill switch for its parent box. If this job fails, AutoSys immediately terminates the box and all remaining pending inner jobs. The box status becomes FAILURE.

Use cases: Data validation jobs, prerequisite checks, file existence verification — any job whose failure makes downstream processing meaningless.

Anti-pattern: Do NOT mark optional jobs or cleanup jobs as box_terminator. Their failure should not stop the entire box.

How do I prevent downstream jobs from running after a failure?

Two approaches:

box_terminator: 1 on the critical upstream job. When it fails, the entire box terminates. All pending downstream jobs skip to TERMINATED.
condition: success(upstream_job) on each downstream job. Downstream jobs only start when the upstream succeeds — but this requires maintaining conditions on potentially many jobs.

Best practice: Use box_terminator on the first critical validation job. Use success() conditions on jobs that are downstream of the validation but within the same box for additional safety.

How do I test AutoSys HA failover?

In a staging environment:

Verify replication is in sync: autoflags -a (look for 'SHADOW STATUS: IN_SYNC')
Stop the primary Event Server (gracefully if possible): sendevent -E STOP_DEMON on the primary host, or stop the database listener.
Monitor autoflags -a on the shadow — it should promote within 60-90 seconds.
Verify Event Processor is running on the new primary: chk_auto_up -A
Test job execution: submit a test job and verify it runs correctly.
Restore the original primary and re-establish replication (procedure varies by DB).

Do this quarterly. Document every step. If you can't test automatically, test manually — but test.

Should I set n_retrys on every job?

Not necessarily. n_retrys is best for jobs that interface with external systems prone to transient failures (network services, external APIs, databases under load, cloud storage).

For jobs with deterministic inputs and outputs (data transformations, calculations, local file processing), a single failure usually warrants human investigation rather than automatic retry. The failure is likely a logic bug or missing data — retrying won't help.

General rule: Set n_retrys: 2 for external-facing jobs. Set n_retrys: 0 for purely computational jobs. Monitor retry rates regardless — if any job needs retries >1% of runs, investigate the root cause.

What's the difference between TERMINATED and FAILURE?

FAILURE: Job ran to completion but returned a non-zero exit code (or was killed by box_terminator on another job). AutoSys treats this as 'job tried and failed'.

TERMINATED: Job was forcibly killed without completing by term_run_time, FORCE_TERMINATE_JOB, or system signal. AutoSys treats this as 'job stopped without finishing'.

Key difference for dependencies: success(job) only fires on SUCCESS status. It does NOT fire on TERMINATED or FAILURE. If you want downstream jobs to run after a timeout (TERMINATED), use condition: success(upstream) OR status(upstream) = 'TERMINATED'.

TERMINATED is often worse than FAILURE because it's silent — no alarm fires by default unless you explicitly monitor for it.

🔥

Naren Founder & Author

Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.

About Naren Get in touch

Forged with 🔥 at TheCodeForge.io — Where Developers Are Forged