AutoSys Real-World Patterns and Best Practices
- Use a consistent naming convention that includes environment, system, function, and frequency
- The 3-level hierarchy (master box → section boxes → job chains) is the standard pattern for complex batch
- Pre-check jobs with box_terminator stop wasted time on doomed runs; post-check jobs validate success
- Naming conventions encode environment, system, function, frequency for instant job identification
- 3-level box hierarchy: master box → section boxes → job chains for clean orchestration
- Pre-check jobs with box_terminator stop doomed runs early; post-checks validate output
- Parallel execution uses box jobs with condition success(prev_box) and separate dependency chains
- Error handling chains combine alarm_if_fail, n_retrys, and notification to catch failures before they escalate
AutoSys Quick Debug Cheat Sheet
Job failed – need exit code and last run time
autorep -J job_name -jautorep -J job_name -q | grep -E 'last_start|exit_code'Box job not starting – need to see conditions
autorep -J box_name -q -wautorep -J box_name -q -cJob in SUCCESS but shouldn't have run yet
autorep -j job_name -r 5grep job_name /var/log/autosys/*.log | tail -20sendevent command not taking effect
sendevent -e PING_EVENTautosyslog -l | grep -i 'event_not_found'Production Incident
Production Debug GuideCommon job failure scenarios and the exact commands to diagnose and fix them
autorep -J job_name -q to verify the command ran correctly. Look for exit code in job history with autorep -j job_name.ping, ping -n machine. Then check the Event Server logs for agent communication issues. Use sendevent -e FORCE_STARTJOB with caution to kill and restart.autorep -J box_name -q -w to see the box status and pending conditions. Look for unsatisfied conditions with autorep -J box_name -q -c.sendevent -e CHANGE_STATUS -s SUCCESS to force mark the job as successful after manual fix.Having worked with AutoSys means understanding not just the syntax but the patterns that experienced architects use to build batch workflows that run reliably for years. These are the practices that separate a well-run AutoSys environment from one where every incident is a fire drill.
Naming conventions — the difference between sane and unmanageable
In a large AutoSys environment with thousands of jobs, naming conventions are everything. A consistent, searchable naming convention means you can find any job in seconds and understand its purpose without documentation.
Recommended pattern: <ENVIRONMENT>_<SYSTEM>_<FUNCTION>_<FREQUENCY>
/* Good naming examples */ PRD_TRADING_EXTRACT_DAILY /* production, trading system, extract, runs daily */ PRD_TRADING_TRANSFORM_DAILY PRD_TRADING_LOAD_DAILY PRD_PAYROLL_RUN_WEEKLY /* production, payroll, weekly */ PRD_RISK_REPORT_EOD /* production, risk, report, end-of-day */ /* Box jobs: suffix with _BOX */ PRD_TRADING_EOD_BOX PRD_PAYROLL_BOX /* File Watchers: suffix with _FW or _WATCH */ PRD_TRADING_SETTLE_FW PRD_FEEDS_MARKET_DATA_FW /* BAD naming (don't do this) */ job1 my_script test_final_v2_FINAL
autorep -J PRD_% you know you're looking at production. This simple prefix saves incidents.<TEAM>_<SYSTEM>_... prefix to avoid collisions.The standard EOD orchestration pattern
The standard pattern for end-of-day batch is a three-level hierarchy: master box → section boxes → job chains. This gives you visibility at multiple levels and makes partial failure recovery clean.
/* Level 1: Master box — overall EOD coordinator */ insert_job: PRD_EOD_MASTER_BOX job_type: BOX date_conditions: 1 days_of_week: mon-fri start_times: "21:00" alarm_if_fail: 1 /* Level 2: Section boxes — logical groupings */ insert_job: PRD_EOD_EXTRACT_BOX job_type: BOX box_name: PRD_EOD_MASTER_BOX insert_job: PRD_EOD_TRANSFORM_BOX job_type: BOX box_name: PRD_EOD_MASTER_BOX condition: success(PRD_EOD_EXTRACT_BOX) insert_job: PRD_EOD_REPORT_BOX job_type: BOX box_name: PRD_EOD_MASTER_BOX condition: success(PRD_EOD_TRANSFORM_BOX) /* Level 3: Actual CMD jobs inside section boxes */ insert_job: PRD_TRADE_EXTRACT_DAILY job_type: CMD box_name: PRD_EOD_EXTRACT_BOX command: /scripts/extract_trades.sh machine: etl-server-01 owner: batchuser alarm_if_fail: 1 n_retrys: 1 std_out_file: /logs/autosys/PRD_TRADE_EXTRACT_DAILY.out std_err_file: /logs/autosys/PRD_TRADE_EXTRACT_DAILY.err
Always include a pre-check and post-check job
Professional batch workflows include a pre-check job (validates environment/inputs before starting) and a post-check job (validates outputs after completion). These save enormous debugging time.
/* Pre-check: validates disk space, DB connectivity, input files */ insert_job: PRD_EOD_PRE_CHECK job_type: CMD box_name: PRD_EOD_MASTER_BOX command: /scripts/eod_pre_check.sh machine: etl-server-01 owner: batchuser box_terminator: 1 /* if pre-check fails, kill the entire box */ alarm_if_fail: 1 /* Post-check: validates output record counts, checksums, file presence */ insert_job: PRD_EOD_POST_CHECK job_type: CMD box_name: PRD_EOD_MASTER_BOX command: /scripts/eod_post_check.sh machine: etl-server-01 owner: batchuser condition: success(PRD_EOD_REPORT_BOX) alarm_if_fail: 1
Parallel execution pattern – running independent tasks concurrently
AutoSys can run jobs in parallel inside a box by default. But you need to be intentional: use separate section boxes with no dependency for truly parallel work, or use condition statements to fork and join. The key is to avoid overwhelming the Event Server with hundreds of simultaneous conditions.
Pattern: Create a parent box, then inside it, define multiple section boxes that have no inter-dependency. Each section box runs its jobs in parallel. Use a final section box that depends on all parallel boxes (using condition: success(PARALLEL_BOX_1) & success(PARALLEL_BOX_2)) to join the execution.
/* Master box that orchestrates parallel work */ insert_job: PRD_EOD_PARALLEL_MASTER job_type: BOX date_conditions: 1 start_times: "22:00" /* Parallel section boxes — no dependency between them */ insert_job: PRD_REPORT_A_BOX job_type: BOX box_name: PRD_EOD_PARALLEL_MASTER insert_job: PRD_REPORT_B_BOX job_type: BOX box_name: PRD_EOD_PARALLEL_MASTER /* Inside each box: jobs that can run in parallel */ insert_job: PRD_REPORT_A_GEN job_type: CMD box_name: PRD_REPORT_A_BOX command: /scripts/gen_report_a.sh machine: rep-server-01 alarm_if_fail: 1 insert_job: PRD_REPORT_A_EMAIL job_type: CMD box_name: PRD_REPORT_A_BOX command: /scripts/email_report_a.sh condition: success(PRD_REPORT_A_GEN) alarm_if_fail: 1 /* Join box that runs after both parallel sections complete */ insert_job: PRD_EOD_JOIN_BOX job_type: BOX box_name: PRD_EOD_PARALLEL_MASTER condition: success(PRD_REPORT_A_BOX) & success(PRD_REPORT_B_BOX) insert_job: PRD_EOD_FINALIZE job_type: CMD box_name: PRD_EOD_JOIN_BOX command: /scripts/eod_finalize.sh machine: etl-server-01 alarm_if_fail: 1
- Boxes with no condition on each other execute in parallel
- Use & (AND) condition on a join box to wait for all parallel streams
- Avoid putting hundreds of jobs in one flat box — they'll still be parallel but become unmanageable
- Alarm on failures inside parallel boxes individually, not at the join box
Error handling chains — catching failures before they cascade
A well-designed AutoSys environment uses a layered error handling chain: immediate retry (n_retrys), job-level alarm (alarm_if_fail), box-level escalation, and finally notification to operations. Don't just set 'alarm_if_fail: 1' and hope. Design the chain so that transient failures auto-recover, permanent failures trigger alerts, and critical failures page a human.
Pattern: For I/O jobs on external systems, set n_retrys: 2 with a short interval. For validation jobs, set alarm_if_fail: 1 and make them box_terminator. For business-critical workflows, add a notification job that runs condition: failure(job_name).
/* Job that calls an external HTTP API — transient failures are common */ insert_job: PRD_TRADING_FETCH_RATES job_type: CMD command: /scripts/fetch_exchange_rates.sh machine: api-server-01 owner: batchuser max_run_alarm: 300 /* Alert if job runs longer than 5 minutes */ n_retrys: 2 alarm_if_fail: 1 /* Job that validates input — if fail, stop the whole box */ insert_job: PRD_TRADING_VALIDATE_INPUT job_type: CMD command: /scripts/validate_input.sh machine: etl-server-01 box_terminator: 1 alarm_if_fail: 1 /* Notification job that triggers on failure of critical predecessor */ insert_job: PRD_EOD_FAIL_NOTIFY job_type: CMD command: /scripts/send_pager.sh "EOD batch failed at step: PRD_TRADING_FETCH_RATES" machine: notify-server-01 condition: failure(PRD_TRADING_FETCH_RATES) alarm_if_fail: 1
| Pattern | Benefit | When to apply |
|---|---|---|
| 3-level box hierarchy | Visibility at multiple levels, clean partial recovery | All complex EOD/batch workflows |
| Pre/post check jobs | Catch environmental issues early, validate output | Any workflow with external dependencies |
| box_terminator on validation | Stop the whole box on critical pre-condition failure | Input validation, pre-requisite checks |
| n_retrys: 1 or 2 on I/O jobs | Handle transient network/DB blips automatically | Jobs calling external services or DBs |
| Environment prefix in names | Prevent cross-environment accidents | All environments, always |
| Parallel section boxes with join | Reduce batch window by running independent work concurrently | Independent reports, parallel batch streams |
| Error handling chains | Layer retries, alarms, and notifications for reliable recovery | Any critical path in the batch |
🎯 Key Takeaways
- Use a consistent naming convention that includes environment, system, function, and frequency
- The 3-level hierarchy (master box → section boxes → job chains) is the standard pattern for complex batch
- Pre-check jobs with box_terminator stop wasted time on doomed runs; post-check jobs validate success
- Version-control your JIL scripts — every change tracked, every rollback possible
- Parallel execution can cut batch windows but limit fan-out to under 15 branches
- Design error handling chains: retry transients, alarm for permanents, page for criticals
⚠ Common Mistakes to Avoid
Interview Questions on This Topic
- QWhat naming convention would you use for AutoSys jobs?Mid-levelReveal
- QDescribe the 3-level box hierarchy pattern for EOD batch orchestration.SeniorReveal
- QWhy would you use a pre-check job with box_terminator?Mid-levelReveal
- QHow do you make an AutoSys environment version-controlled?SeniorReveal
- QWhat's the difference between a well-designed AutoSys environment and a poorly-designed one?SeniorReveal
Frequently Asked Questions
What naming convention should I use for AutoSys jobs?
A common and effective pattern is ENVIRONMENT_SYSTEM_FUNCTION_FREQUENCY — for example, PRD_TRADING_EXTRACT_DAILY. This makes jobs self-documenting and searchable. Always prefix with the environment (PRD/QAT/DEV) to prevent accidental cross-environment mistakes.
What is the 3-level box hierarchy pattern in AutoSys?
The 3-level pattern is: a master box that controls the overall run schedule, section boxes (grouped by logical function like EXTRACT, TRANSFORM, REPORT), and CMD jobs inside each section box. This gives you visibility at multiple levels and clean partial recovery.
Should I version control my JIL scripts?
Yes, absolutely. Store all JIL definitions in Git (or your corporate SCM). Every change is tracked with who made it and why. When a schedule change breaks something, git log tells you exactly what changed. Many teams require a peer review on JIL changes before they're applied to production.
What is a pre-check job in AutoSys?
A pre-check job runs at the start of a box, before any real processing, and validates that all preconditions are met: sufficient disk space, database connectivity, input files present, dependent systems available. It's marked as box_terminator: 1 so a failed pre-check immediately stops the entire box rather than wasting hours of processing on a doomed run.
How many jobs is too many for one AutoSys box?
There's no hard limit, but more than 20-30 jobs in a single box starts to become hard to manage visually and operationally. When a box grows large, refactor it into a parent box with child section boxes. The 3-level hierarchy scales to hundreds of jobs while remaining manageable.
Can I run jobs in parallel in AutoSys?
Yes, by default jobs inside a box run in parallel unless you add conditions to serialize them. To control parallelism intentionally, create multiple section boxes with no cross-dependencies. Use a join box with a compound condition (condition: success(BoxA) & success(BoxB)) to synchronize after parallel execution.
How do I handle temporary failures without alerting operations?
Use n_retrys on the job definition. For example, n_retrys: 2 will automatically retry the job up to two times before reporting failure. Set the retry interval with max_run_alarm to avoid missing timeouts. Combine with alarm_if_fail to alert only if all retries are exhausted.
Developer and founder of TheCodeForge. I built this site because I was tired of tutorials that explain what to type without explaining why it works. Every article here is written to make concepts actually click.