AutoSys n_retrys — Why Retry 4 Always Worked
A job succeeded daily while losing records.
20+ years shipping production infrastructure and CI/CD at scale. Drawn from code that ran under real load.
- n_retrys: AutoSys retries failed jobs N times. Handles network blips. Default alarm fires only after all retries exhaust.
- box_terminator: Stops the entire box when a critical job fails. Use on validation jobs — bad input shouldn't propagate.
- term_run_time: Hard kill after N minutes. Prevents hung jobs from blocking downstream workflows forever.
- Dual Event Server HA: Automatic failover takes 60-90 seconds. Running jobs continue; new jobs wait.
- The 3 AM lesson: Retries without root-cause fixes mask problems. Permanent failures still need humans.
Fault tolerance in AutoSys is like building redundancy into your plans. If the main road is blocked (job fails), you want automatic detours (retries), emergency alerts (alarms), and a backup plan (recovery jobs). Good fault tolerance means problems get handled automatically at 3 AM without waking anyone up.
Enterprise batch workflows run overnight when no one's watching. The jobs that matter most — payroll, settlement, reconciliation — are the ones where failures cost the most.
Here's the problem most teams learn the hard way: retries mask flaky scripts until they don't. Box terminators stop bad data from propagating, but only if you put them in the right place. And HA failover? 60-90 seconds feels fast until it's your 2 AM SLA.
This isn't theory. These are the patterns that actually keep workflows alive when things break.
How AutoSys Event Server Event Processor Actually Works
The AutoSys Event Server Event Processor is the core engine that evaluates job dependencies and triggers execution. It polls the event server database for new events—job completions, status changes, or timer expirations—and processes them against a dependency graph. This is a polling-based, not push-based, system: the processor runs in a loop, querying for unprocessed events at a configurable interval (default 10 seconds).
When an event arrives, the processor checks all downstream jobs that depend on it. For each job, it evaluates the dependency condition—typically a Boolean expression of job statuses (SUCCESS, FAILURE, TERMINATED) combined with AND/OR logic. If the condition is met, the processor submits the job to the execution queue. The critical property is that the processor is single-threaded per event server instance, meaning event processing is sequential—no two events are evaluated concurrently for the same job.
Use this processor for any AutoSys environment where job scheduling must respect inter-job dependencies. It matters because it enforces ordering guarantees without requiring custom scripting. In practice, the polling interval and database load are the primary scaling constraints—at high event rates (e.g., 100+ events/second), the processor becomes the bottleneck, and you must tune the poll frequency or shard event servers.
Automatic retry with n_retrys
The simplest fault tolerance mechanism. n_retrys tells AutoSys to automatically rerun a failed job N times before declaring it a final FAILURE. This handles transient failures like brief network blips or temporary database connection issues.
A hidden detail: n_retrys counts retries after the initial attempt. n_retrys: 3 means up to 4 total runs. The retry interval is controlled by the profile setting 'max_exit' default — usually 60 seconds between attempts.
Warning: alarm_if_fail fires only after ALL retries exhaust. If your job succeeds on retry 3, no alarm ever fires. This is good for transient failures but terrible for masking permanent issues.
box_terminator — stopping the box on critical failure
In a BOX with multiple independent jobs, a failure in one job normally leaves other jobs to continue. That's usually what you want — a reporting job failing shouldn't stop the data load.
But sometimes one job's failure should stop everything. If your validation step says 'input data is corrupt', there's zero point running the 50 downstream jobs. They'll just produce garbage.
box_terminator: 1 marks the kill switch. When that job fails, AutoSys immediately terminates the entire box. All pending inner jobs skip to TERMINATED state. The box status becomes FAILURE immediately — no waiting for other jobs to finish.
- Normal failure = other jobs continue. box_terminator failure = whole box stops.
- Only one job per box should be box_terminator — usually the first validation job.
- Box stays in FAILURE until manually restarted or conditionally cleared.
- Downstream jobs move to TERMINATED, not FAILURE. They never attempt to run.
term_run_time — preventing the infinite hang
A job that runs forever is worse than a job that fails. Failing at least triggers alerts and retries. Hanging just blocks everything downstream indefinitely.
term_run_time kills a job after N minutes from its start time. The count begins when the job starts (including retries — each retry resets the timer). When term_run_time expires, AutoSys sends a SIGTERM to the agent. The agent terminates the job process and updates status to TERMINATED.
Crucial difference: TERMINATED is NOT FAILURE. Conditions like success(job) won't trigger on TERMINATED. If you want downstream jobs to run after a timeout, you need condition: status(job) != 'RUNNING' or a custom wrapper script that checks exit codes.
HA architecture for fault tolerance
At the infrastructure level, AutoSys supports high availability through the dual Event Server architecture. For mission-critical batch environments, this is non-negotiable.
How it works: Primary Event Server handles all writes. Shadow Event Server maintains a real-time replica via database replication (Oracle Data Guard, Sybase Replication, etc.). The Event Processor monitors the primary through heartbeat checks (default 60 seconds).
When the heartbeat fails, the Event Processor promotes the shadow to primary. Total downtime: 60-90 seconds. During this window, running jobs continue unaffected. However, no new jobs start. The Event Processor queues events during failover and processes them once the new primary is online.
Critical nuance: Replication lag is your enemy. If the shadow is 5 minutes behind when the primary fails, you lose 5 minutes of events. Those job completions, status changes, and sendevent calls are gone.
How the Event Processor Drops Jobs (and Why You Should Care)
The Event Server doesn't process every event. It drops them. Not randomly — by design. When a job definition changes mid-flight, or a manual event comes in while the scheduler is crunching, the processor throttles. It uses a priority queue and a sliding window. If the event rate exceeds the configured max_events_per_second, the processor starts ignoring lower-priority events. That means your 'run now' request might get silently eaten if the system is busy processing a batch of 2000 nightly jobs.
You check syslog. Nothing. You check the event server logs. Silence. You start blaming the network. Stop. Check event_svr_config. The default max_events_per_second is 500. If you're running 2000 jobs with staggered starts, you're hitting that ceiling every 4 seconds. Events get dropped without warning. No error code. No log entry. Just gone.
The fix is straightforward: raise the limit based on your workload, or — better — implement a client-side backoff in your event submission scripts. Never assume the processor took your event. Always verify with autosyslog or event_ack.
The Hidden Cost of Global Event Filters
Global event filters look like a clean solution. You set a filter to ignore all events from a retired application. Problem disappears. Except now the processor is still parsing every event, matching it against the filter, and discarding it. That's CPU time. That's I/O. That's the event queue growing while your processor burns cycles on discards.
Every event goes through the full pipeline: socket read, parse, authenticate, filter match, priority assign, queue insert. A filter at the end doesn't skip the first three steps. With a global filter matching 30% of events, you're wasting 30% of your processor's throughput on work that produces nothing. Over a 24-hour window, that translates to minutes of latency for legitimate events.
The smarter play: use network-layer filtering or application-level segregation. Block events at the source. If you must use global filters, measure the filter-match CPU time with event_svr_config --stats. If it's above 5% of total CPU, you're burning resources. Drop them upstream.
Event Processor Syntax: The Declarative Contract
The AutoSys Event Processor uses a declarative YAML syntax to define job behavior, retry policies, and box termination rules. Every job definition must start with insert_job: followed by a unique job name. The job_type: field accepts c, box, or f (command, box, file watcher). Critical attributes include command: for executable paths, machine: for target hosts, and owner: for execution credentials. Control flow is handled via condition: expressions using logical operators (s(jobA) for success, f(jobB) for failure). Box jobs group child jobs with a box_name: attribute; global settings like n_retrys and term_run_time cascade from parent boxes. The alarm_if_fail: boolean triggers alerts. All timestamps use 24-hour format in date_conditions: blocks. This strict syntax prevents silent failures — missing job_type: or malformed conditions cause the Event Processor to reject the job entirely, logging a clear rejection reason to $AUTOUSER/events.log.
machine: attribute causes the Event Processor to default to 'localhost', which silently fails on remote deployments. Always validate your JIL before importing.job_type:, command:, and machine: — omitting even one causes silent rejection.Real-World Examples: Retry, Termination & HA in Action
A common production pattern uses n_retrys: 3 and term_run_time: 600 to handle transient failures. Example: a file ingestion job that retries on network blips but aborts after 10 minutes. The Event Processor logs each retry attempt in $AUTOUSER/events.log with a JOB_RETRY event. When box_terminator: y is set on a parent box, a single child failure with alarm_if_fail: y triggers immediate termination of all running siblings — not just future ones. For HA, deploy a minimum of 3 Event Servers behind a load balancer. Each server shares an autosys_ha_file: on NFS (autosys_ha_file: /shared/autosys/ha_state). During failover, the secondary Event Server reads the last processed event from the shared file, resuming exactly where the primary left off. Global event filters (%include for job prefixes, %exclude for machines) reduce noise but accelerate database fragmentation — rotate the $AUTOSYS/events.db weekly to maintain performance.
n_retrys with term_run_time — otherwise a stuck process can retry indefinitely, exhausting CPU credits in cloud environments.s() and f() operators, combined with box_terminator, creates deterministic failure isolation for critical pipelines.The Retry That Masked a 6-Month-Old Bug
- n_retrys masks failures. Monitor attempt counts, not just final status.
- If a job regularly needs retries, you have a root cause — fix it, don't retry it.
- alarm_if_fail fires only after all retries. Use a separate alert for first failure.
- 5 retries exhaust at 5 * retry_interval. That's hours of delay before alarm.
autorep -J JOBNAME -q | grep n_retrysautorep -J JOBNAME -L 20 | grep FAILUREKey takeaways
Common mistakes to avoid
5 patternsSetting n_retrys too high (e.g., 10)
Not using box_terminator on validation jobs
Treating n_retrys as a substitute for fixing flaky scripts
Not testing HA failover
No term_run_time on external-facing jobs
Interview Questions on This Topic
How does n_retrys work in AutoSys?
Frequently Asked Questions
JIL syntax, sendevent, autorep, box jobs, file watchers, scheduling, HA, security, cloud workload automation, and 22 interview questions — the definitive AutoSys reference for production engineers.
20+ years shipping production infrastructure and CI/CD at scale. Drawn from code that ran under real load.
That's AutoSys. Mark it forged?
6 min read · try the examples if you haven't