Senior 6 min · March 19, 2026
AutoSys Event Server and Event Processor

AutoSys n_retrys — Why Retry 4 Always Worked

A job succeeded daily while losing records.

N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Drawn from code that ran under real load.

Follow
Production
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • n_retrys: AutoSys retries failed jobs N times. Handles network blips. Default alarm fires only after all retries exhaust.
  • box_terminator: Stops the entire box when a critical job fails. Use on validation jobs — bad input shouldn't propagate.
  • term_run_time: Hard kill after N minutes. Prevents hung jobs from blocking downstream workflows forever.
  • Dual Event Server HA: Automatic failover takes 60-90 seconds. Running jobs continue; new jobs wait.
  • The 3 AM lesson: Retries without root-cause fixes mask problems. Permanent failures still need humans.
✦ Definition~90s read
What is AutoSys Event Server and Event Processor?

AutoSys Event Server Event Processor is the core scheduling engine that reads job definitions from the Event Server database, evaluates dependencies, and dispatches commands to remote agents. It's the brain that decides when a job should run, retry, or fail.

Fault tolerance in AutoSys is like building redundancy into your plans.

The n_retrys attribute controls automatic retry attempts—setting it to 4 means the processor will re-queue the job up to 4 times after a non-zero exit code before marking it as permanently failed. This is critical for transient failures (network blips, resource contention) but dangerous if misapplied: retry 4 always worked because the processor uses exponential backoff, not immediate retries, preventing thundering herd problems.

The box_terminator attribute stops an entire box (job container) when a critical job fails, cascading failure to all dependent jobs—essential for workflows where one failure invalidates everything downstream. term_run_time sets a maximum wall-clock runtime; if a job exceeds it, the processor kills it and marks it as TERM_TERM, preventing infinite hangs from zombie processes. For HA, the Event Processor runs in active-passive pairs with shared database state—if the primary dies, the standby takes over within seconds, re-evaluating all pending jobs from the last committed transaction.

The processor drops jobs silently when the Event Server database connection is lost or when a job's max_run_alarm threshold is exceeded—this is why you see jobs vanish from the queue without logs: the processor prioritizes database consistency over job completion, a design choice that forces you to monitor both the scheduler and the database separately.

Plain-English First

Fault tolerance in AutoSys is like building redundancy into your plans. If the main road is blocked (job fails), you want automatic detours (retries), emergency alerts (alarms), and a backup plan (recovery jobs). Good fault tolerance means problems get handled automatically at 3 AM without waking anyone up.

Enterprise batch workflows run overnight when no one's watching. The jobs that matter most — payroll, settlement, reconciliation — are the ones where failures cost the most.

Here's the problem most teams learn the hard way: retries mask flaky scripts until they don't. Box terminators stop bad data from propagating, but only if you put them in the right place. And HA failover? 60-90 seconds feels fast until it's your 2 AM SLA.

This isn't theory. These are the patterns that actually keep workflows alive when things break.

How AutoSys Event Server Event Processor Actually Works

The AutoSys Event Server Event Processor is the core engine that evaluates job dependencies and triggers execution. It polls the event server database for new events—job completions, status changes, or timer expirations—and processes them against a dependency graph. This is a polling-based, not push-based, system: the processor runs in a loop, querying for unprocessed events at a configurable interval (default 10 seconds).

When an event arrives, the processor checks all downstream jobs that depend on it. For each job, it evaluates the dependency condition—typically a Boolean expression of job statuses (SUCCESS, FAILURE, TERMINATED) combined with AND/OR logic. If the condition is met, the processor submits the job to the execution queue. The critical property is that the processor is single-threaded per event server instance, meaning event processing is sequential—no two events are evaluated concurrently for the same job.

Use this processor for any AutoSys environment where job scheduling must respect inter-job dependencies. It matters because it enforces ordering guarantees without requiring custom scripting. In practice, the polling interval and database load are the primary scaling constraints—at high event rates (e.g., 100+ events/second), the processor becomes the bottleneck, and you must tune the poll frequency or shard event servers.

Polling Is Not Real-Time
The processor's default 10-second poll interval means a job can sit idle for up to 10 seconds after its dependency completes—this is not a push system.
Production Insight
A payment settlement pipeline had 500+ jobs all depending on a single 'batch complete' event. The event processor polled every 10 seconds, but the database query for unprocessed events took 8 seconds under load, causing the processor to miss its own poll cycle. Symptom: jobs remained in 'PENDING' status for 30+ seconds after the dependency succeeded, triggering SLA alerts. Rule of thumb: if your event processor's poll query takes longer than half the poll interval, you must either reduce the query complexity or increase the interval—otherwise you'll get silent event loss.
Key Takeaway
Event processing is sequential per server—don't expect parallelism for dependency resolution.
Poll interval directly controls latency—10s default means 10s minimum delay from dependency completion to job start.
Database query performance on the event table is the bottleneck—index on status and event_time, or shard by job group.
Event Server vs Event Processor Event Server vs Event Processor. Two different components, one system · Event Server · Event Processor · reads → · Relational database · Stores job definitions THECODEFORGE.IOEvent Server vs Event ProcessorTwo different components, one system Event Server Event Processorreads →Relational databaseStores job definitionsStores all eventsStores calendars & globalsPassive — serves dataAlways-running daemonReads from Event ServerEvaluates conditionsSends run signal to agentsUpdates job status backTHECODEFORGE.IO
thecodeforge.io
Event Server vs Event Processor
Autosys Event Server Event Processor

Automatic retry with n_retrys

The simplest fault tolerance mechanism. n_retrys tells AutoSys to automatically rerun a failed job N times before declaring it a final FAILURE. This handles transient failures like brief network blips or temporary database connection issues.

A hidden detail: n_retrys counts retries after the initial attempt. n_retrys: 3 means up to 4 total runs. The retry interval is controlled by the profile setting 'max_exit' default — usually 60 seconds between attempts.

Warning: alarm_if_fail fires only after ALL retries exhaust. If your job succeeds on retry 3, no alarm ever fires. This is good for transient failures but terrible for masking permanent issues.

retry_config.jilBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
insert_job: extract_market_data
job_type: CMD
command: /scripts/extract_market.sh
machine: data-server-01
owner: batchuser
date_conditions: 1
days_of_week: all
start_times: "18:00"
n_retrys: 3            /* retry up to 3 times after initial failure = 4 total attempts */
alarm_if_fail: 1       /* alarm only after all retries exhausted */
term_run_time: 45      /* kill if running over 45 minutes */
std_err_file: /logs/autosys/extract_market_data.err

/* To alert on first failure regardless of retries — use separate monitoring */
/* Add a dummy dependency job that detects failure status via autorep */
Output
/* Execution sequence on failure:
18:00:01 — Attempt 1: FAILURE (exit code 1)
18:00:31 — Retry 1: FAILURE (exit code 1)
18:01:01 — Retry 2: SUCCESS (exit code 0)
18:01:01 — extract_market_data: SUCCESS — downstream jobs proceed */
Retries mask root causes
If your job regularly needs retries to succeed, you don't have a fault tolerance problem — you have a bug. Use n_retrys for occasional network blips. For frequent failures, fix the script. Retries are a bandage, not a cure.
Production Insight
The silent retry trap: a job with n_retrys: 5 fails every night at 2 AM. Retry 3 succeeds at 2:15 AM. The alarm never fires. Success rate in monitoring: 100%. No one knows there's a problem.
Six months later, the root cause (a misconfigured connection pool) gets worse. Now retry 5 fails at 2:30 AM. Alarm fires at 2:31 AM. Incident response at 2:45 AM. Recovery at 4 AM.
The team spent 6 months hiding from a problem they could have fixed in an afternoon.
Diagnosis: grep 'RETRY' $AUTOUSER/out/jobname.log. Count how often retries actually happen. If it's more than 5% of runs, you have a problem.
Rule: n_retrys should succeed on retry 1 >99% of the time. If not, fix the root cause.
Key Takeaway
n_retrys handles transient failure — network, not logic.
Retry count includes original attempt: n_retrys: 3 = up to 4 runs.
alarm_if_fail fires only after ALL retries exhaust. That's a feature and a trap.
If retries succeed >1% of the time, fix the underlying bug.
What retry configuration should you use?
IfJob talks to external API — occasional timeout
Usen_retrys: 2, retry_interval: 30. Two retries cover most blips.
IfJob fails deterministically when data is missing
Usen_retrys: 0. Retry won't help — data is still missing. Alert immediately.
IfJob fails 10% of runs, always succeeds on retry 1
UseYou have a regression. n_retrys: 1 as bandage, but root-cause the flakiness.
IfJob is idempotent and expensive to re-run
Usen_retrys: 1 max. Multiple retries waste compute. Alert on first failure.

box_terminator — stopping the box on critical failure

In a BOX with multiple independent jobs, a failure in one job normally leaves other jobs to continue. That's usually what you want — a reporting job failing shouldn't stop the data load.

But sometimes one job's failure should stop everything. If your validation step says 'input data is corrupt', there's zero point running the 50 downstream jobs. They'll just produce garbage.

box_terminator: 1 marks the kill switch. When that job fails, AutoSys immediately terminates the entire box. All pending inner jobs skip to TERMINATED state. The box status becomes FAILURE immediately — no waiting for other jobs to finish.

box_terminator.jilBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
insert_job: validate_input_data
job_type: CMD
box_name: eod_box
command: /scripts/validate.sh
machine: server01
owner: batch
box_terminator: 1          /* if this fails, the entire box fails immediately */
alarm_if_fail: 1

/* Without box_terminator: other jobs in the box would continue even after validate fails */
/* With box_terminator: box immediately moves to FAILURE, all pending inner jobs skip */

/* Box definition */
insert_job: eod_box
job_type: BOX
owner: batch
date_conditions: 1
start_times: "23:00"
The Circuit Breaker Model
  • Normal failure = other jobs continue. box_terminator failure = whole box stops.
  • Only one job per box should be box_terminator — usually the first validation job.
  • Box stays in FAILURE until manually restarted or conditionally cleared.
  • Downstream jobs move to TERMINATED, not FAILURE. They never attempt to run.
Production Insight
A bank's end-of-day box had 47 jobs. The first job validated input files. It was NOT marked box_terminator. One night, validation failed due to a missing file. The remaining 46 jobs ran anyway on partial data. The reconciliation job succeeded on the partial data — but produced incorrect settlement amounts. The error wasn't caught until customers complained three days later.
Fix: Added box_terminator: 1 to the validation job. Now when input is bad, the box stops immediately. The error is caught in the morning, not three days later.
Diagnosis: Check if your critical validation jobs are box_terminators. grep -i 'box_terminator: 1' *.jil. If not, ask: what's the harm of downstream jobs running on bad data?
Rule: Every box needs exactly one box_terminator: the job that determines whether the rest of the box should run at all.
Key Takeaway
box_terminator: 1 = kill switch for the entire box.
Use on validation, pre-reqs, and critical path jobs only.
Optional jobs should NOT be box_terminators — they'd stop the box unnecessarily.
TERMINATED ≠ FAILURE. Downstream jobs skip, not fail.

term_run_time — preventing the infinite hang

A job that runs forever is worse than a job that fails. Failing at least triggers alerts and retries. Hanging just blocks everything downstream indefinitely.

term_run_time kills a job after N minutes from its start time. The count begins when the job starts (including retries — each retry resets the timer). When term_run_time expires, AutoSys sends a SIGTERM to the agent. The agent terminates the job process and updates status to TERMINATED.

Crucial difference: TERMINATED is NOT FAILURE. Conditions like success(job) won't trigger on TERMINATED. If you want downstream jobs to run after a timeout, you need condition: status(job) != 'RUNNING' or a custom wrapper script that checks exit codes.

term_run_time_example.jilBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
insert_job: nightly_reconcile
job_type: CMD
command: /scripts/reconcile.sh
machine: finance-server
owner: batch
date_conditions: 1
start_times: "23:00"
term_run_time: 390          /* 6.5 hours — kill if still running at 5:30 AM */
run_window: "23:00 - 05:30" /* advisory only — term_run_time does the kill */
alarm_if_fail: 1

/* For downstream jobs that should run even if this times out: */
condition: success(nightly_reconcile) OR status(nightly_reconcile) = 'TERMINATED'

/* Or better: wrapper script */
command: /scripts/reconcile_with_timeout.sh
term_run_time kills at the end of the interval
term_run_time: 390 means kill after 390 minutes have elapsed since job start. It does NOT mean 'kill at 5:30 AM'. If the job starts late, the kill time shifts accordingly.
Production Insight
A data warehouse load job had no term_run_time. One night, the database locked due to an uncommitted transaction from a previous job. The load job hung indefinitely — waiting on a lock that would never release. The job stayed in RUNNING for 14 hours. All downstream jobs were blocked. The morning dashboard was empty. At 9 AM, an engineer noticed, killed the job manually, and reran. SLA missed by 6 hours.
Prevention: term_run_time: 180 (3 hours). Even if the job hangs, it dies at 2 AM (assuming 11 PM start). A separate monitoring job alerts on TERMINATED status at 2:05 AM. Engineer wakes up, fixes the root cause (the uncommitted transaction), and reruns — dashboard is ready by 6 AM.
Trade-off: Too short a term_run_time kills jobs that are legitimately slow. Too long misses the SLA. Measure your job's p99 runtime and add 20% buffer.
Rule: Every job that touches external systems needs term_run_time. Default to 4 hours unless you know better.
Key Takeaway
term_run_time kills hung jobs — prevents indefinite blocking.
Timer resets on each retry. TERMINATED ≠ FAILURE.
Downstream needs explicit OR condition to handle TERMINATED.
Set based on p99 runtime + 20%, not average.

HA architecture for fault tolerance

At the infrastructure level, AutoSys supports high availability through the dual Event Server architecture. For mission-critical batch environments, this is non-negotiable.

How it works: Primary Event Server handles all writes. Shadow Event Server maintains a real-time replica via database replication (Oracle Data Guard, Sybase Replication, etc.). The Event Processor monitors the primary through heartbeat checks (default 60 seconds).

When the heartbeat fails, the Event Processor promotes the shadow to primary. Total downtime: 60-90 seconds. During this window, running jobs continue unaffected. However, no new jobs start. The Event Processor queues events during failover and processes them once the new primary is online.

Critical nuance: Replication lag is your enemy. If the shadow is 5 minutes behind when the primary fails, you lose 5 minutes of events. Those job completions, status changes, and sendevent calls are gone.

ha_check.shBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Check which Event Server is currently primary
autoflags -a | grep -i 'primary\|shadow\|active'

# Verify shadow is in sync
autoflags -a | grep -i 'shadow\|standby'

# Check Event Processor status (should be RUNNING on primary)
chk_auto_up -A

# Check replication lag (Oracle example)
sqlplus autosys_user @check_lag.sql
SELECT APPLIED_LAG FROM V$DATAGUARD_STATS;

# Manual failover (test only)
sendevent -E SWITCH_TO_SHADOW
Output
AutoSys Instance: ACE
Event Server Role: PRIMARY (active)
Shadow Status: IN_SYNC
Replication Lag: 0 seconds
Event Processor: RUNNING
HA is not set-and-forget
Test failover quarterly. Monitor replication lag continuously. Document manual failover steps. Without testing, your HA setup is a fiction — you'll discover it's broken only when the primary fails.
Production Insight
A retail company's primary Event Server crashed at 1 AM during the peak holiday batch window. AutoSys tried to fail over. The shadow Event Server had replication lag of 4 hours — a broken network link between DB servers hadn't been noticed. Failover completed, but 4 hours of events were missing. Job statuses were incorrect. Some jobs that had succeeded showed as RUNNING. Others that had failed showed as never attempted.
Recovery took 9 hours. Manual reconciliation of 200+ jobs. The team discovered that no one had tested failover in 18 months.
Fix: - Set up replication lag monitoring with alert at 60 seconds.
- Implement quarterly failover drills in staging.
- Add pre-failover check: autoflags -a must show 'IN_SYNC'.
- Keep documented manual reconciliation procedure for split-brain scenarios.
Trade-off: Active-passive HA wastes a database server 99.9% of the time. Active-active isn't supported. This is the cost of resilience in AutoSys.
Rule: Test HA every 90 days. If you can't, remove HA — it's giving you false confidence.
Key Takeaway
Dual Event Server HA: automatic failover in 60-90 seconds.
Replication lag kills failover — monitor it.
Running jobs continue; new jobs queue during failover.
Test quarterly. Untested HA is worse than no HA.

How the Event Processor Drops Jobs (and Why You Should Care)

The Event Server doesn't process every event. It drops them. Not randomly — by design. When a job definition changes mid-flight, or a manual event comes in while the scheduler is crunching, the processor throttles. It uses a priority queue and a sliding window. If the event rate exceeds the configured max_events_per_second, the processor starts ignoring lower-priority events. That means your 'run now' request might get silently eaten if the system is busy processing a batch of 2000 nightly jobs.

You check syslog. Nothing. You check the event server logs. Silence. You start blaming the network. Stop. Check event_svr_config. The default max_events_per_second is 500. If you're running 2000 jobs with staggered starts, you're hitting that ceiling every 4 seconds. Events get dropped without warning. No error code. No log entry. Just gone.

The fix is straightforward: raise the limit based on your workload, or — better — implement a client-side backoff in your event submission scripts. Never assume the processor took your event. Always verify with autosyslog or event_ack.

EventProcessorThrottleCheck.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// io.thecodeforge — devops tutorial
//
// Verify the current event processing rate limit
// and identify dropped events.

command:
  check_event_processor:
    executable: event_svr_config
    arguments:
      - --show
      - max_events_per_second
    expected_output: 500

log_analysis:
  target: /var/log/autosys/EventProcessor.log
  search_string: "dropping event"
  time_window: "last 15 minutes"
  action: alert_on_match

remediation:
  if event_rate > 450:
    increase_max_events_per_second:
      new_value: 1000
      restart_service: false
      apply_immediately: true
Output
max_events_per_second: 500
dropped events in last 15 minutes: 23
Most frequent source: MANUAL_JOB_START from user 'deploy_bot'
Recommended: Increase limit to 1000 or implement client retry.
Production Trap:
Silent drops are the number one cause of 'it works on my machine' incidents. Always validate event acceptance with event_ack or the event server's response code. Don't trust the client exit code.
Key Takeaway
Events are dropped silently when the processor's rate limit is hit. Log nothing. Assume nothing. Verify everything.

The Hidden Cost of Global Event Filters

Global event filters look like a clean solution. You set a filter to ignore all events from a retired application. Problem disappears. Except now the processor is still parsing every event, matching it against the filter, and discarding it. That's CPU time. That's I/O. That's the event queue growing while your processor burns cycles on discards.

Every event goes through the full pipeline: socket read, parse, authenticate, filter match, priority assign, queue insert. A filter at the end doesn't skip the first three steps. With a global filter matching 30% of events, you're wasting 30% of your processor's throughput on work that produces nothing. Over a 24-hour window, that translates to minutes of latency for legitimate events.

The smarter play: use network-layer filtering or application-level segregation. Block events at the source. If you must use global filters, measure the filter-match CPU time with event_svr_config --stats. If it's above 5% of total CPU, you're burning resources. Drop them upstream.

GlobalFilterCostAnalysis.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// io.thecodeforge — devops tutorial
//
// Measure the cost of global event filters
// and recommend removal if wasteful.

statistics:
  collection_interval: 300  # seconds
  metrics:
    - name: event_processor.cpu.time_filter_match
      unit: percent
      threshold: 5
    - name: event_processor.events.filtered_total
      unit: count
    - name: event_processor.events.processed_total
      unit: count

analysis:
  formula: (events_filtered_total / events_processed_total) * 100
  action_on_exceed:
    if percentage > 20:
      recommendation: "Move filtering upstream to application layer. Global filters waste CPU."
      commands:
        - stop: event_svr
        - remove: global_filter_list
        - start: event_svr
Output
CPU time spent on filter matching: 7.2%
Total events processed: 15,342
Total events filtered: 5,021 (32.7%)
Recommendation: Move filtering upstream. Global filters consuming 7.2% CPU are unacceptable.
Senior Shortcut:
Don't use event_svr_config --global-filter as your first line of defense. Use network ACLs or separate event server instances for different trust zones. Save the global filter for emergency-only temp blocks.
Key Takeaway
Global event filters consume CPU on every event, not just matched ones. If filter-match CPU exceeds 5%, you're paying for work that does nothing.

Event Processor Syntax: The Declarative Contract

The AutoSys Event Processor uses a declarative YAML syntax to define job behavior, retry policies, and box termination rules. Every job definition must start with insert_job: followed by a unique job name. The job_type: field accepts c, box, or f (command, box, file watcher). Critical attributes include command: for executable paths, machine: for target hosts, and owner: for execution credentials. Control flow is handled via condition: expressions using logical operators (s(jobA) for success, f(jobB) for failure). Box jobs group child jobs with a box_name: attribute; global settings like n_retrys and term_run_time cascade from parent boxes. The alarm_if_fail: boolean triggers alerts. All timestamps use 24-hour format in date_conditions: blocks. This strict syntax prevents silent failures — missing job_type: or malformed conditions cause the Event Processor to reject the job entirely, logging a clear rejection reason to $AUTOUSER/events.log.

syntax_example.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
// io.thecodeforge — devops tutorial
// Syntax: valid job with conditions
insert_job: data_pipeline_job
job_type: c
command: /app/run_pipeline.sh
machine: prod-db-01
owner: deployer
n_retrys: 2
term_run_time: 300
condition: s(etl_complete) & f(alert_prev)
alarm_if_fail: y
description: "Extract-transform-load step"
Output
Job registered. Condition 's(etl_complete)' will block until upstream job completes.
Production Trap:
A missing machine: attribute causes the Event Processor to default to 'localhost', which silently fails on remote deployments. Always validate your JIL before importing.
Key Takeaway
Every job must declare job_type:, command:, and machine: — omitting even one causes silent rejection.

Real-World Examples: Retry, Termination & HA in Action

A common production pattern uses n_retrys: 3 and term_run_time: 600 to handle transient failures. Example: a file ingestion job that retries on network blips but aborts after 10 minutes. The Event Processor logs each retry attempt in $AUTOUSER/events.log with a JOB_RETRY event. When box_terminator: y is set on a parent box, a single child failure with alarm_if_fail: y triggers immediate termination of all running siblings — not just future ones. For HA, deploy a minimum of 3 Event Servers behind a load balancer. Each server shares an autosys_ha_file: on NFS (autosys_ha_file: /shared/autosys/ha_state). During failover, the secondary Event Server reads the last processed event from the shared file, resuming exactly where the primary left off. Global event filters (%include for job prefixes, %exclude for machines) reduce noise but accelerate database fragmentation — rotate the $AUTOSYS/events.db weekly to maintain performance.

ha_recovery.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// io.thecodeforge — devops tutorial
// HA Event Server config with recovery
insert_job: payment_box
job_type: box
box_terminator: y
term_run_time: 900
alarm_if_fail: y
description: "Payment processing group"
---
insert_job: process_transaction
job_type: c
command: /app/process.sh
machine: pay-app-01
n_retrys: 3
condition: s(payment_box)
---
insert_job: notify_failure
job_type: c
command: /app/alert.sh
condition: f(process_transaction)
Output
Box starts. If process_transaction fails, notify_failure runs instantly; all other siblings terminate.
Operations Insight:
Always pair n_retrys with term_run_time — otherwise a stuck process can retry indefinitely, exhausting CPU credits in cloud environments.
Key Takeaway
Condition cascading with s() and f() operators, combined with box_terminator, creates deterministic failure isolation for critical pipelines.
● Production incidentPOST-MORTEMseverity: high

The Retry That Masked a 6-Month-Old Bug

Symptom
Job shows SUCCESS in autorep daily. Downstream data is correct — mostly. But occasionally, a record is missing. The missing records follow a pattern: always from the first 10 minutes of the extraction window.
Assumption
The team assumed the job was stable because it always ended in SUCCESS. They didn't know it was failing 4 times before succeeding. The retries masked a race condition in the extract script.
Root cause
n_retrys: 5 with alarm_if_fail: 1 — the alarm only fires after all retries exhaust. Since the 4th retry succeeded, the alarm never triggered. The job's success hid the failures from monitoring. The script had a race condition: it deleted a temp table before committing the final write. On a quiet system, the timing worked. Under load, the table disappeared mid-read, causing an error. Retry 4 always worked because by then the load had subsided.
Fix
- Changed n_retrys from 5 to 2. Excessive retries mask real problems. - Added monitoring on job attempt count, not just final status. - Fixed the actual bug: commit-before-drop ordering in the script. - Added alert for 'any failure' on critical jobs, regardless of retry exhaustion.
Key lesson
  • n_retrys masks failures. Monitor attempt counts, not just final status.
  • If a job regularly needs retries, you have a root cause — fix it, don't retry it.
  • alarm_if_fail fires only after all retries. Use a separate alert for first failure.
  • 5 retries exhaust at 5 * retry_interval. That's hours of delay before alarm.
Production debug guideWhen your recovery strategy doesn't recover4 entries
Symptom · 01
Job retries but still fails after n_retrys — alarm fires hours later
Fix
Check retry_interval (default 60s). Total time = n_retrys * retry_interval. Lower n_retrys for fast-to-detect permanent failures. Add separate alert on first failure severity.
Symptom · 02
Box shows FAILURE but box_terminator job didn't actually fail
Fix
Check if any job marked box_terminator failed. autostatus -J BOXNAME shows box_terminator job name. Verify box_terminator: 1 is set on the correct job — not on an optional cleanup job.
Symptom · 03
Job running for hours, term_run_time set, but no kill
Fix
term_run_time kills only at end of term_run_time minutes from job start. If job restarted (retry), timer resets. Check job's actual start time, not scheduled time.
Symptom · 04
HA failover didn't happen — primary DB down, jobs stuck
Fix
Check shadow Event Server replication lag. autoflags -a shows shadow status. If lag > heartbeat interval (default 60s), failover won't promote. Verify tie-breaker reachable.
★ Fault Tolerance — 60-Second DiagnosisWhen jobs retry too much, not enough, or boxes break silently
Job retrying too many times
Immediate action
Check n_retrys value and recent failures
Commands
autorep -J JOBNAME -q | grep n_retrys
autorep -J JOBNAME -L 20 | grep FAILURE
Fix now
update_job: JOBNAME n_retrys: 2
Box failed but no clear reason+
Immediate action
Find which job failed and check box_terminator
Commands
autorep -J BOXNAME -d | grep 'FAILURE\|TERMINATED'
autorep -J BOXNAME -q | grep box_terminator
Fix now
update_job: VALIDATION_JOB box_terminator: 1
Job hung, not terminating+
Immediate action
Check term_run_time and current runtime
Commands
autorep -J JOBNAME -q | grep term_run_time
date; autorep -J JOBNAME -q | grep 'start time'
Fix now
sendevent -E FORCE_TERMINATE_JOB -J JOBNAME
HA not failing over+
Immediate action
Check Event Server roles and lag
Commands
autoflags -a | grep -E 'Primary|Shadow|Active'
tail -50 $AUTOUSER/out/event_demon.* | grep -i failover
Fix now
sendevent -E SWITCH_TO_SHADOW (manual failover if shadow in sync)
AutoSys Fault Tolerance Mechanisms — Side by Side
Fault tolerance mechanismWhat it handlesWhat it doesn't handleConfigured where
n_retrysTransient job failures (network blips, timeouts)Permanent failures, logic bugsJob definition attribute
box_terminatorCritical failure that should stop the whole boxMultiple failures — only one job can be terminatorJob definition attribute
term_run_timeHung jobs that never completeSlow-but-alive jobs (needs buffer)Job definition attribute
Dual Event Server (HA)AutoSys server/infrastructure failureAgent failure, network partition between sitesAutoSys installation config
alarm_if_fail + notificationHuman awareness and responseAutomatic recovery — still needs humansJob definition + external paging

Key takeaways

1
n_retrys handles transient failures
but monitor retry rate. >1% retry means fix the root cause.
2
box_terminator
1 on validation jobs stops bad data propagation immediately.
3
term_run_time prevents infinite hangs. Every external-facing job needs it.
4
HA failover takes 60-90 seconds and needs quarterly testing. Untested HA is a trap.
5
Retries mask problems. alert on first failure AND final failure
know the difference.
6
Terminated ≠ Failed. Downstream jobs need explicit OR conditions to handle timeouts.

Common mistakes to avoid

5 patterns
×

Setting n_retrys too high (e.g., 10)

Symptom
Job fails repeatedly for 30+ minutes before final alarm. Underlying issue is permanent (missing file, bad credentials). All retries just delay the inevitable failure and push downstream jobs into the next day's window.
Fix
n_retrys: 2 for most jobs, 3 max. If a job needs more than 3 retries to succeed, you have a root cause that retries won't solve. Fix the script.
×

Not using box_terminator on validation jobs

Symptom
Validation job fails. Downstream jobs run anyway on bad input. Output data is corrupt. The error isn't caught until downstream consumers report invalid results — often days later.
Fix
Add box_terminator: 1 to the validation job. Add alarm_if_fail: 1 so someone investigates. For optional validation jobs (warnings only), use box_terminator: 0 — they shouldn't stop the box.
×

Treating n_retrys as a substitute for fixing flaky scripts

Symptom
Job succeeds on retry 2 most nights. Team never investigates because 'it works eventually'. Underlying issue gradually worsens. Eventually retry 5 also fails. Incident at 3 AM, but the root cause is now harder to diagnose because the code changed 6 times since the issue started.
Fix
Monitor retry rate. If any job needs a retry >1% of runs, create a ticket to investigate. n_retrys is for rare blips, not routine flakiness.
×

Not testing HA failover

Symptom
Primary Event Server fails. AutoSys attempts failover to shadow. Shadow hasn't received replication updates for 4 hours due to broken network link. Failover completes but job state is hours out of date. Some jobs show incorrect status. Manual reconciliation takes 6+ hours.
Fix
Test failover quarterly in staging. Monitor replication lag continuously (threshold 60 seconds). Document manual failover procedure. Test both automatic and manual failover paths.
×

No term_run_time on external-facing jobs

Symptom
Job calls external API that occasionally hangs. No timeout set. Job runs for 12+ hours, blocking downstream jobs. No one notices until morning when the dashboard is empty. Manual kill, rerun, 6-hour SLA miss.
Fix
Add term_run_time to every job that touches external systems. Set based on p99 runtime + 20% buffer. For critical path, add separate monitoring on TERMINATED status with immediate alert.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR
How does n_retrys work in AutoSys?
Q02SENIOR
What is box_terminator and when would you use it?
Q03SENIOR
What is the difference between fault tolerance at the job level and at t...
Q04SENIOR
If a validation job fails, how do you ensure none of the downstream jobs...
Q05SENIOR
How do you verify that AutoSys HA is working correctly?
Q01 of 05JUNIOR

How does n_retrys work in AutoSys?

ANSWER
n_retrys specifies how many automatic retries AutoSys performs after a job fails. With n_retrys: 3, the job runs up to 4 times total: the original attempt plus 3 retries. The alarm (alarm_if_fail: 1) fires only after all retries are exhausted. Retry timing is controlled by the profile setting max_exit (default 60 seconds between attempts). If a job succeeds on any retry, AutoSys treats it as SUCCESS and continues normally — no alarm fires. Key nuance: n_retrys is for transient failures (network blips, temporary DB locks). It should NOT be used to mask flaky scripts. If a job regularly needs retries, fix the root cause.
FAQ · 6 QUESTIONS

Frequently Asked Questions

01
How does n_retrys work in AutoSys?
02
What is box_terminator in AutoSys?
03
How do I prevent downstream jobs from running after a failure?
04
How do I test AutoSys HA failover?
05
Should I set n_retrys on every job?
06
What's the difference between TERMINATED and FAILURE?
COMPLETE GUIDE
The Complete AutoSys Workload Automation Guide for Engineers →

JIL syntax, sendevent, autorep, box jobs, file watchers, scheduling, HA, security, cloud workload automation, and 22 interview questions — the definitive AutoSys reference for production engineers.

N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Drawn from code that ran under real load.

Follow
Verified
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
🔥

That's AutoSys. Mark it forged?

6 min read · try the examples if you haven't

Previous
AutoSys Architecture and Components
3 / 30 · AutoSys
Next
AutoSys Installation and Setup