Pre-check jobs with box_terminator stop doomed runs early; post-checks validate output
Parallel execution uses box jobs with condition success(prev_box) and separate dependency chains
Error handling chains combine alarm_if_fail, n_retrys, and notification to catch failures before they escalate
Plain-English First
AutoSys patterns are like recipes that experienced batch architects have discovered work well in production. This article shares the ones that actually matter — naming conventions that save debugging time, orchestration patterns that handle failures gracefully, and operational habits that keep large environments manageable.
Having worked with AutoSys means understanding not just the syntax but the patterns that experienced architects use to build batch workflows that run reliably for years. These are the practices that separate a well-run AutoSys environment from one where every incident is a fire drill.
Naming conventions — the difference between sane and unmanageable
In a large AutoSys environment with thousands of jobs, naming conventions are everything. A consistent, searchable naming convention means you can find any job in seconds and understand its purpose without documentation.
Your naming convention should encode the environment
Including PRD/QAT/DEV at the start makes it impossible to accidentally submit jobs to the wrong environment. When you autorep -J PRD_% you know you're looking at production. This simple prefix saves incidents.
Production Insight
Teams that skip naming conventions spend hours every month searching for jobs.
The worst case: two jobs named 'daily_extract' in different systems — autosys shows both, you pick the wrong one.
Rule: enforce naming conventions with a script that rejects new jobs not matching the pattern.
Key Takeaway
Name every job like someone will grep for it in 3 years.
Make the environment prefix non-negotiable.
Bad naming is technical debt that compounds with every new job.
When to enforce naming conventions
IfEnvironment has fewer than 50 jobs
→
UseConventions are helpful but not critical — you can still navigate manually.
IfEnvironment has more than 200 jobs
→
UseMandatory conventions — use a git hook to reject JIL that doesn't match the pattern.
IfMultiple teams submit jobs
→
UseStart with a simple <TEAM>_<SYSTEM>_... prefix to avoid collisions.
thecodeforge.io
EOD Batch Best Practice Pattern
Autosys Real World Patterns
The standard EOD orchestration pattern
The standard pattern for end-of-day batch is a three-level hierarchy: master box → section boxes → job chains. This gives you visibility at multiple levels and makes partial failure recovery clean.
The 3-level pattern saved a trading team when the extract box failed at 22:30.
They only needed to rerun the EXTRACT section, not the entire EOD.
Master box success depends on all sections; but failed sections can be restarted independently.
Key Takeaway
3-level box hierarchy isolates failures to a section, not the whole batch.
Restart becomes surgical: fix and rerun only the broken box.
This pattern scales to hundreds of jobs without chaos.
When to use 3-level vs simpler structure
IfFewer than 10 jobs, no dependency between groups
→
UseSingle flat box with conditions is sufficient.
If10-50 jobs with logical phases
→
UseUse 3-level hierarchy for clear failure isolation.
IfOver 50 jobs, multiple teams own different phases
→
UseFurther nest section boxes for each team's workload.
Always include a pre-check and post-check job
Professional batch workflows include a pre-check job (validates environment/inputs before starting) and a post-check job (validates outputs after completion). These save enormous debugging time.
AutoSys can run jobs in parallel inside a box by default. But you need to be intentional: use separate section boxes with no dependency for truly parallel work, or use condition statements to fork and join. The key is to avoid overwhelming the Event Server with hundreds of simultaneous conditions.
Pattern: Create a parent box, then inside it, define multiple section boxes that have no inter-dependency. Each section box runs its jobs in parallel. Use a final section box that depends on all parallel boxes (using condition: success(PARALLEL_BOX_1) & success(PARALLEL_BOX_2)) to join the execution.
parallel_pattern.jilBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
/* Master box that orchestrates parallel work */
insert_job: PRD_EOD_PARALLEL_MASTER
job_type: BOX
date_conditions: 1
start_times: "22:00"
/* Parallel section boxes — no dependency between them */
insert_job: PRD_REPORT_A_BOX
job_type: BOX
box_name: PRD_EOD_PARALLEL_MASTER
insert_job: PRD_REPORT_B_BOX
job_type: BOX
box_name: PRD_EOD_PARALLEL_MASTER
/* Inside each box: jobs that can run in parallel */
insert_job: PRD_REPORT_A_GEN
job_type: CMD
box_name: PRD_REPORT_A_BOX
command: /scripts/gen_report_a.sh
machine: rep-server-01
alarm_if_fail: 1
insert_job: PRD_REPORT_A_EMAIL
job_type: CMD
box_name: PRD_REPORT_A_BOX
command: /scripts/email_report_a.sh
condition: success(PRD_REPORT_A_GEN)
alarm_if_fail: 1
/* Join box that runs after both parallel sections complete */
insert_job: PRD_EOD_JOIN_BOX
job_type: BOX
box_name: PRD_EOD_PARALLEL_MASTER
condition: success(PRD_REPORT_A_BOX) & success(PRD_REPORT_B_BOX)
insert_job: PRD_EOD_FINALIZE
job_type: CMD
box_name: PRD_EOD_JOIN_BOX
command: /scripts/eod_finalize.sh
machine: etl-server-01
alarm_if_fail: 1
Parallel execution mental model
Boxes with no condition on each other execute in parallel
Use & (AND) condition on a join box to wait for all parallel streams
Avoid putting hundreds of jobs in one flat box — they'll still be parallel but become unmanageable
Alarm on failures inside parallel boxes individually, not at the join box
Production Insight
Parallel execution cut a night batch window from 6 hours to 2.5 hours.
But the first attempt overwhelmed the Event Server with 200 simultaneous conditions — we hit Autosys's internal condition queue limit.
Fix: limit parallel fan-out to no more than 10-15 independent branches.
Key Takeaway
Parallel execution is where AutoSys shines and fails hardest.
Keep fan-out under 15 branches to avoid Event Server bottlenecks.
Always join parallel streams with a clean condition — don't rely on box completion.
When to use parallel execution
IfJobs are independent and run on different machines
→
UseParallel execution reduces wall-clock time significantly.
IfJobs share a single database or file system
→
UseBe careful — parallel I/O can cause contention. Test with staged parallelism.
IfYou need strict ordering after parallel work
→
UseUse a join box with a compound condition to synchronize.
Error handling chains — catching failures before they cascade
A well-designed AutoSys environment uses a layered error handling chain: immediate retry (n_retrys), job-level alarm (alarm_if_fail), box-level escalation, and finally notification to operations. Don't just set 'alarm_if_fail: 1' and hope. Design the chain so that transient failures auto-recover, permanent failures trigger alerts, and critical failures page a human.
Pattern: For I/O jobs on external systems, set n_retrys: 2 with a short interval. For validation jobs, set alarm_if_fail: 1 and make them box_terminator. For business-critical workflows, add a notification job that runs condition: failure(job_name).
error_chain.jilBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
/* Job that calls an external HTTPAPI — transient failures are common */
insert_job: PRD_TRADING_FETCH_RATES
job_type: CMD
command: /scripts/fetch_exchange_rates.sh
machine: api-server-01
owner: batchuser
max_run_alarm: 300 /* Alertif job runs longer than 5 minutes */
n_retrys: 2
alarm_if_fail: 1
/* Job that validates input — if fail, stop the whole box */
insert_job: PRD_TRADING_VALIDATE_INPUT
job_type: CMD
command: /scripts/validate_input.sh
machine: etl-server-01
box_terminator: 1
alarm_if_fail: 1
/* Notification job that triggers on failure of critical predecessor */
insert_job: PRD_EOD_FAIL_NOTIFY
job_type: CMD
command: /scripts/send_pager.sh "EOD batch failed at step: PRD_TRADING_FETCH_RATES"
machine: notify-server-01
condition: failure(PRD_TRADING_FETCH_RATES)
alarm_if_fail: 1
Don't rely solely on alarm_if_fail
If your alarm system uses AutoSys's built-in alerting, make sure it's actually configured to send to your monitoring tool. Many teams discover too late that alarm_if_fail only logs to a file — it doesn't email or page anyone unless you configure the Event Server to do so.
Production Insight
A trading firm lost $50k because a job retried 3 times (n_retrys: 3), each time after 60 seconds, delaying failure detection by 3 minutes.
They changed to n_retrys: 1 with alarm on final failure.
Rule: n_retrys is for transient blips, not permanent failures — don't delay alerting trying to retry through a broken state.
Key Takeaway
Design your error chain like a circuit breaker — retry for transient, alarm for permanent, page for critical.
Never let n_retrys mask a real production issue.
Use condition: failure(job_name) to trigger notification jobs for escalation.
Choosing retry vs immediate alarm
IfJob calls an external API (transient failures)
→
UseUse n_retrys: 2 with short interval. Monitor success rate — if >5% fail after retries, fix the API.
IfJob validates input files (permanent if missing)
→
UseNo retries. Set alarm_if_fail: 1 and box_terminator: 1.
IfJob is a data load with idempotent script
→
UseYou can retry more aggressively (n_retrys: 3) because replayion is safe.
● Production incidentPOST-MORTEMseverity: high
Missing Input File Takes Down Entire EOD Batch
Symptom
EOD batch started at 21:00. At 21:45, the first transform job failed with missing file. The box continued running other jobs until all were blocked. Investigation took 30 minutes. Recovery required restarting the entire batch after the file arrived at 01:30.
Assumption
The team assumed the file would always arrive before the batch started because it had for the past year. There was no pre-check to validate its presence.
Root cause
No pre-check job was defined to verify input file existence before processing. Missing box_terminator on validation meant the box continued despite the missing dependency, wasting compute and masking the issue.
Fix
Added a pre-check job at the start of the master box that checks for all required input files. Set box_terminator: 1 so the entire EOD batch stops immediately if any file is missing. Added alerts to the operations team.
Key lesson
Always validate external dependencies before starting batch processing
Use box_terminator on pre-check jobs to stop wasted work early
Monitor file arrivals separately from batch execution
Production debug guideCommon job failure scenarios and the exact commands to diagnose and fix them4 entries
Symptom · 01
Job shows SUCCESS but expected output is missing
→
Fix
Check the std_out_file for the job. Use autorep -J job_name -q to verify the command ran correctly. Look for exit code in job history with autorep -j job_name.
Symptom · 02
Job stuck in RUNNING state for hours
→
Fix
Check if the machine is reachable: ping, ping -n machine. Then check the Event Server logs for agent communication issues. Use sendevent -e FORCE_STARTJOB with caution to kill and restart.
Symptom · 03
Box job never starts even though conditions appear met
→
Fix
Verify box start_times and date_conditions. Use autorep -J box_name -q -w to see the box status and pending conditions. Look for unsatisfied conditions with autorep -J box_name -q -c.
Symptom · 04
Job fails with n_retrys exhausted but you want it to keep running
→
Fix
Increase n_retrys or implement a retry logic inside the script itself (e.g., loop with sleep). Use sendevent -e CHANGE_STATUS -s SUCCESS to force mark the job as successful after manual fix.
★ AutoSys Quick Debug Cheat SheetFast commands to diagnose and fix common AutoSys job failures without digging through docs.
Check the scripting log in the directory specified in std_out_file/std_err_file.
Box job not starting – need to see conditions+
Immediate action
Show box definition with status
Commands
autorep -J box_name -q -w
autorep -J box_name -q -c
Fix now
If condition depends on a failed job, restart that job first: sendevent -e FORCE_STARTJOB -J failed_job. If it's a time condition, verify start_times and days_of_week.
Job in SUCCESS but shouldn't have run yet+
Immediate action
Check job history for recent changes
Commands
autorep -j job_name -r 5
grep job_name /var/log/autosys/*.log | tail -20
Fix now
Look for sendevent commands or calendar overrides that might have triggered the job early. Check for global variable changes.
sendevent command not taking effect+
Immediate action
Verify user has permissions and Event Server is reachable
Commands
sendevent -e PING_EVENT
autosyslog -l | grep -i 'event_not_found'
Fix now
Try running sendevent with the full path: $AUTOUSER/sendevent. If ping fails, restart the Event Server agent.
Pattern summary
Pattern
Benefit
When to apply
3-level box hierarchy
Visibility at multiple levels, clean partial recovery
All complex EOD/batch workflows
Pre/post check jobs
Catch environmental issues early, validate output
Any workflow with external dependencies
box_terminator on validation
Stop the whole box on critical pre-condition failure
Input validation, pre-requisite checks
n_retrys: 1 or 2 on I/O jobs
Handle transient network/DB blips automatically
Jobs calling external services or DBs
Environment prefix in names
Prevent cross-environment accidents
All environments, always
Parallel section boxes with join
Reduce batch window by running independent work concurrently
Independent reports, parallel batch streams
Error handling chains
Layer retries, alarms, and notifications for reliable recovery
Any critical path in the batch
Key takeaways
1
Use a consistent naming convention that includes environment, system, function, and frequency
2
The 3-level hierarchy (master box → section boxes → job chains) is the standard pattern for complex batch
3
Pre-check jobs with box_terminator stop wasted time on doomed runs; post-check jobs validate success
4
Version-control your JIL scripts
every change tracked, every rollback possible
5
Parallel execution can cut batch windows but limit fan-out to under 15 branches
6
Design error handling chains
retry transients, alarm for permanents, page for criticals
Common mistakes to avoid
5 patterns
×
Building flat job lists with hundreds of conditions instead of using box hierarchy
Symptom
Maintenance nightmare: changing one dependency requires updating dozens of conditions. A single failure in the middle of the list can cascade incorrectly.
Fix
Wrap logical groups in boxes. Use conditions only between boxes, not between individual jobs across groups. The 3-level hierarchy should be your default.
×
Skipping pre-check jobs to save time
Symptom
A 2-hour batch run fails at step 50 because disk was full, wasting 2 hours of processing. The batch cannot be resumed; it must be restarted from scratch.
Fix
Always add a pre-check job at the start of the master box that validates all prerequisites. Set box_terminator: 1 so the batch stops immediately if anything is wrong.
×
Inconsistent naming that makes searching impossible
Symptom
Engineers spend 30+ minutes trying to find the right job. Two jobs with similar names cause confusion — one in production, one in test. Manual documentation is the only way to understand job purpose.
Fix
Establish a naming convention before the environment grows. Enforce it with a script that rejects JIL not matching the pattern. Include environment, system, function, and frequency.
×
Not version-controlling JIL scripts
Symptom
Someone changes a job and it breaks. No one knows what changed, when, or why. Rolling back requires manually reconstructing the previous definition from memory.
Fix
Store all JIL definitions in Git. Use pull requests for changes. Run git log on a job to see its entire history. Every rollback is a simple revert.
×
Too many parallel branches overwhelming the Event Server
Symptom
Jobs go into PENDING state but never start, or condition evaluation slows to a crawl. The Event Server CPU spikes and existing jobs take longer to complete.
Fix
Keep parallel fan-out under 15 independent branches. If you need more concurrency, implement sub-scheduling or stagger start times. Monitor Event Server CPU usage during batch windows.
INTERVIEW PREP · PRACTICE MODE
Interview Questions on This Topic
Q01SENIOR
What naming convention would you use for AutoSys jobs?
Q02SENIOR
Describe the 3-level box hierarchy pattern for EOD batch orchestration.
Q03SENIOR
Why would you use a pre-check job with box_terminator?
Q04SENIOR
How do you make an AutoSys environment version-controlled?
Q05SENIOR
What's the difference between a well-designed AutoSys environment and a ...
Q01 of 05SENIOR
What naming convention would you use for AutoSys jobs?
ANSWER
I'd use a four-part pattern: ENVIRONMENT_SYSTEM_FUNCTION_FREQUENCY, for example PRD_TRADING_EXTRACT_DAILY. The environment prefix (PRD/QAT/DEV) prevents cross-environment mistakes. The system identifier allows filtering jobs by system. The function describes what the job does (extract, load, report). The frequency distinguishes periodic jobs. Box jobs get a _BOX suffix. This convention makes jobs self-documenting and grep-friendly.
Q02 of 05SENIOR
Describe the 3-level box hierarchy pattern for EOD batch orchestration.
ANSWER
The pattern has three levels: a master box (the top-level coordinator with time conditions), section boxes inside the master (logical groupings like EXTRACT, TRANSFORM, REPORT, each with success conditions on the previous section box), and actual CMD jobs inside each section box. This gives visibility at multiple levels and allows partial recovery: if the TRANSFORM box fails, you can fix and rerun only that section without restarting the entire EOD.
Q03 of 05SENIOR
Why would you use a pre-check job with box_terminator?
ANSWER
A pre-check job validates that all prerequisites are met before any processing starts: disk space, database connectivity, input files, dependent systems available. Setting box_terminator: 1 means if the pre-check fails, the entire box (and all jobs inside it) immediately stops. This prevents wasting hours of compute time on a run that is guaranteed to fail later. It also surfaces the root cause early instead of hiding it under a cascade of downstream errors.
Q04 of 05SENIOR
How do you make an AutoSys environment version-controlled?
ANSWER
Store every JIL definition as a file in a Git repository, one file per job or per logical box. Use a CI pipeline that validates JIL syntax and enforces naming conventions before merging. When a change is approved, the pipeline extracts the JIL and applies it to the target environment using autorep -J job_name -q to get the current definition, then compares with the new version to generate an update script. Many teams also store environment-specific global variables in separate files. Git blame becomes a powerful tool to answer 'who changed this job and why?'
Q05 of 05SENIOR
What's the difference between a well-designed AutoSys environment and a poorly-designed one?
ANSWER
A well-designed environment has: consistent naming conventions that make jobs immediately identifiable; a 3-level box hierarchy that isolates failures to specific sections; pre-check jobs that catch environmental issues early; post-check jobs that validate output; error handling chains that differentiate transient from permanent failures; and version-controlled JIL scripts. A poorly-designed environment has flat lists of jobs with dozens of conditions, naming like 'job1' and 'extract_v2', no pre-checks, and JIL changes that are made directly in production without review. The well-designed one allows a new engineer to find and fix a job in minutes; the poorly-designed one requires tribal knowledge and hours of digging.
01
What naming convention would you use for AutoSys jobs?
SENIOR
02
Describe the 3-level box hierarchy pattern for EOD batch orchestration.
SENIOR
03
Why would you use a pre-check job with box_terminator?
SENIOR
04
How do you make an AutoSys environment version-controlled?
SENIOR
05
What's the difference between a well-designed AutoSys environment and a poorly-designed one?
SENIOR
FAQ · 7 QUESTIONS
Frequently Asked Questions
01
What naming convention should I use for AutoSys jobs?
A common and effective pattern is ENVIRONMENT_SYSTEM_FUNCTION_FREQUENCY — for example, PRD_TRADING_EXTRACT_DAILY. This makes jobs self-documenting and searchable. Always prefix with the environment (PRD/QAT/DEV) to prevent accidental cross-environment mistakes.
Was this helpful?
02
What is the 3-level box hierarchy pattern in AutoSys?
The 3-level pattern is: a master box that controls the overall run schedule, section boxes (grouped by logical function like EXTRACT, TRANSFORM, REPORT), and CMD jobs inside each section box. This gives you visibility at multiple levels and clean partial recovery.
Was this helpful?
03
Should I version control my JIL scripts?
Yes, absolutely. Store all JIL definitions in Git (or your corporate SCM). Every change is tracked with who made it and why. When a schedule change breaks something, git log tells you exactly what changed. Many teams require a peer review on JIL changes before they're applied to production.
Was this helpful?
04
What is a pre-check job in AutoSys?
A pre-check job runs at the start of a box, before any real processing, and validates that all preconditions are met: sufficient disk space, database connectivity, input files present, dependent systems available. It's marked as box_terminator: 1 so a failed pre-check immediately stops the entire box rather than wasting hours of processing on a doomed run.
Was this helpful?
05
How many jobs is too many for one AutoSys box?
There's no hard limit, but more than 20-30 jobs in a single box starts to become hard to manage visually and operationally. When a box grows large, refactor it into a parent box with child section boxes. The 3-level hierarchy scales to hundreds of jobs while remaining manageable.
Was this helpful?
06
Can I run jobs in parallel in AutoSys?
Yes, by default jobs inside a box run in parallel unless you add conditions to serialize them. To control parallelism intentionally, create multiple section boxes with no cross-dependencies. Use a join box with a compound condition (condition: success(BoxA) & success(BoxB)) to synchronize after parallel execution.
Was this helpful?
07
How do I handle temporary failures without alerting operations?
Use n_retrys on the job definition. For example, n_retrys: 2 will automatically retry the job up to two times before reporting failure. Set the retry interval with max_run_alarm to avoid missing timeouts. Combine with alarm_if_fail to alert only if all retries are exhausted.