Senior 7 min · March 19, 2026

AutoSys - max_run_alarm Prevents Hung Job Pipeline Failure

Without max_run_alarm, a hung AutoSys job at 3:17 AM blocks all downstream jobs in PENDING status — discover the fix that GFG tutorials omit..

N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.

Follow
Production
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • Enterprise workload automation for scheduling, dependency, and monitoring
  • Jobs defined via JIL (Job Information Language) — scripts, executables, DB calls
  • Centralised control across hundreds of servers from a single dashboard
  • Job chains: Job B runs only after Job A succeeds; retry and alert logic built-in
  • max_run_alarm prevents hung jobs from blocking downstream for hours
  • Biggest mistake: treating it like cron — AutoSys has its own lifecycle and state machine
✦ Definition~90s read
What is Introduction to AutoSys?

AutoSys is a workload automation platform. At its core it does three things: scheduling (run this job at 3am every weekday), dependency management (run this job only after that job succeeds), and monitoring (alert me if anything takes longer than expected or fails).

AutoSys is basically a smart alarm clock for your servers — except instead of waking you up, it runs programs, scripts, and batch jobs at exactly the right time, in the right order, and tells you when something went wrong.

A 'job' in AutoSys can be any executable — a shell script, a Python script, a Java program, a database procedure call, or even just a system command. AutoSys doesn't care what the job does; it just controls when it runs and what happens next.

Plain-English First

AutoSys is basically a smart alarm clock for your servers — except instead of waking you up, it runs programs, scripts, and batch jobs at exactly the right time, in the right order, and tells you when something went wrong.

If you've ever worked in an enterprise IT environment — banking, insurance, telecom, retail — you've probably heard someone say 'the AutoSys job failed at 2am.' AutoSys is the tool that runs the world's batch processing. It's been doing this since CA Technologies (now Broadcom) released it in the 1990s, and it's still running mission-critical ETL pipelines, payroll runs, and report generation at thousands of companies today.

The reason AutoSys stuck around isn't nostalgia. It's because it solves a real problem that simple cron jobs can't: running complex workflows where Job B depends on Job A, Job A might fail and need a retry, and you need a centralised dashboard to see what's happening across 200 servers at once.

What is AutoSys and what does it actually do

AutoSys is a workload automation platform. At its core it does three things: scheduling (run this job at 3am every weekday), dependency management (run this job only after that job succeeds), and monitoring (alert me if anything takes longer than expected or fails).

A 'job' in AutoSys can be any executable — a shell script, a Python script, a Java program, a database procedure call, or even just a system command. AutoSys doesn't care what the job does; it just controls when it runs and what happens next.

simple_jil_example.jilBASH
1
2
3
4
5
6
7
8
9
/* A basic AutoSys job definition */
insert_job: daily_report
job_type: CMD
command: /opt/scripts/generate_report.sh
machine: prod-server-01
owner: svcaccount
days_of_week: mo,tu,we,th,fr
start_times: "06:00"
description: "Generates daily sales report"
Output
/* Job inserted successfully into AutoSys database */
AutoSys is now owned by Broadcom
AutoSys was originally a CA Technologies product. Broadcom acquired CA in 2018. The product is now officially called Broadcom AutoSys Workload Automation, but most teams still just call it AutoSys.
Production Insight
If you define a job without machine or owner, the insert will fail silently with a syntax error.
Always validate JIL syntax with jil command before deploying to production.
Rule: treat JIL like code — put it under version control.
Key Takeaway
AutoSys is a scheduler, executer, and monitor rolled into one.
JIL is the language you use to tell it what to run and when.
The simplest job needs: name, type, command, machine, and time.
AutoSys Workflow — From Job Definition to Execution Flow diagram showing AutoSys job lifecycle: Define job in JIL → Event Server stores it → Event Processor evaluates → Remote Agent executes → Status reported back. THECODEFORGE.IOWhat AutoSys DoesFrom job definition to execution Define job in JILscript, schedule, machine, conditions Event Server stores itpersistent database of all definitions Event Processor evaluateschecks conditions every cycle Remote Agent executesruns command on target machine Status reported backSUCCESS / FAILURE / TERMINATEDTHECODEFORGE.IO
thecodeforge.io
AutoSys Workflow — From Job Definition to Execution
Introduction Autosys

Why enterprises use AutoSys instead of cron

Cron is great for simple, single-server scheduling. But AutoSys was built for a different scale. When you have hundreds of interdependent jobs running across dozens of servers, cron's limitations become painful fast.

AutoSys gives you: centralised control across all servers from one place, job dependency chains (job C only runs if job A and B both succeeded), a GUI to visualise job flows, automatic retry logic, alerting when jobs take too long or fail, audit trails for compliance, and the ability to put jobs on hold or ice without deleting them. Banks running end-of-day settlement processes can't afford to manage 500 cron entries across 30 servers manually.

Production Insight
Using cron across servers means you lose centralised visibility — a hung job on one server can block everything but you won't see it until the next report fails.
AutoSys's event server captures every state change; you can replay an entire night's history in seconds.
Rule: if your batch pipeline spans more than 3 servers, it's time to drop cron.
Key Takeaway
AutoSys centralises scheduling, dependency, and monitoring across machines.
Cron is per-server, per-script — it breaks at enterprise scale.
The killer feature: dependency conditions with status checks (success, failure, completion).

Who uses AutoSys in the real world

AutoSys is heavily used in industries that run large batch workloads on tight schedules: banking and financial services (end-of-day processing, regulatory reporting), insurance (claims processing, premium calculations), telecoms (billing runs, CDR processing), retail (inventory reconciliation, overnight pricing updates), and healthcare (claims adjudication, HL7 batch feeds).

If you're going for a role as a batch developer, ETL developer, production support engineer, or middleware/integration developer at any large enterprise, there's a solid chance AutoSys is in the stack.

Production Insight
In banking, a single failed batch job can delay regulatory filings by a day — that's a compliance fail.
AutoSys jobs often run on dedicated batch servers that don't have production monitoring agents — you need AutoSys-specific alerting.
Rule: always add max_run_alarm and email notification to every production job.
Key Takeaway
AutoSys dominates in industries where batch reliability is business-critical.
If you work in enterprise IT, you'll likely encounter AutoSys.
Learn JIL basics — they transfer across all AutoSys deployments.

AutoSys Job Lifecycle and Key Concepts

An AutoSys job goes through a defined lifecycle: INITIAL → STARTING → RUNNING → SUCCESS (or TERMINATED). You can also place a job in ON ICE (permanently inactive) or ON HOLD (inactive until its condition is met, then it runs automatically when the condition clears).

Key concepts
  • status: current state of the job
  • condition: expression that controls when a job starts based on upstream job statuses
  • start_times: wall-clock time triggers
  • max_run_alarm: maximum allowed runtime before an alarm fires
  • box: a container job that groups jobs together for scheduling and visibility

Jobs are defined using JIL and stored in the AutoSys Event Server database.

box_jil_example.jilBASH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
/* Box job containing two child jobs with dependency */
insert_job: end_of_day_box
job_type: BOX

insert_job: settlement
job_type: CMD
box_name: end_of_day_box
command: /app/scripts/run_settlement.sh
machine: batch-srv-01
owner: batchuser
start_times: "02:00"

insert_job: reconciliation
job_type: CMD
box_name: end_of_day_box
command: /app/scripts/run_recon.sh
machine: batch-srv-02
owner: batchuser
condition: success(settlement)
start_times: "02:30"
Output
/* Box job 'end_of_day_box' inserted with two children */
Box jobs as folders
  • A box's status aggregates child statuses — if any child fails, the box shows FAILURE.
  • You can start/stop a box, and it cascades to all children.
  • Boxes can be nested, allowing hierarchical grouping of complex workflows.
Production Insight
Box jobs are often misused as 'dummy parents' — if you put a condition on the box itself, children may never start.
A box job with no children will never leave STARTING status.
Rule: children should have their own conditions; box-level conditions are for advanced orchestration only.
Key Takeaway
ON ICE vs ON HOLD: ON ICE ignores conditions and never runs until activated; ON HOLD waits for condition and runs automatically when it clears.
max_run_alarm is your first line of defence against hung jobs — always set it.
Box jobs are organisational — don't overuse them for simple workflows.
When to use a Box job vs individual CMD jobs
IfJobs are independent but belong to the same logical workflow
UseGroup them in a box for visibility and manual control.
IfJobs share the same schedule and retry settings
UsePut them in a box and set box-level attributes (like days_of_week).
IfYou need to start a batch of jobs together but not based on any dependency
UseA box job is perfect — start the box and all children start on their own start_times.
IfJobs are completely independent and don't need grouping
UseUse individual CMD jobs. No box needed.

Common Job Statuses and What They Mean in Production

AutoSys jobs report one of about 12 statuses. The ones you'll encounter most:

  • INITIAL (IN): Job exists but hasn't been activated yet. Usually means it's waiting for its schedule or condition.
  • STARTING (ST): Job is being dispatched to the agent machine.
  • RUNNING (RU): Job is executing on the agent. This is where most hangs occur.
  • SUCCESS (SU): Job completed with exit code 0.
  • FAILURE (FA): Job completed with non-zero exit code.
  • TERMINATED (TE): Job was forcibly killed (by user or max_run_alarm).
  • ON ICE (OI): Job is permanently inactive — won't run even if conditions are met.
  • ON HOLD (OH): Job is temporarily inactive; it becomes active when its condition is satisfied.
  • RESTART (RR): Job was restarted manually or via retry.
  • ACTIVATED (AC): Box job is active and ready to run children.
  • PENDING (PE): Job is queued but waiting for an agent machine to be available.

Knowing the status tells you exactly where to look next.

Don't mistake ON HOLD for ON ICE
ON HOLD defers execution until the condition is true — once the condition is met, it runs immediately. ON ICE never runs unless someone explicitly sends FORCE_STARTJOB. A common production mistake is using ON HOLD when you meant ON ICE, causing jobs to fire unexpectedly at 3am.
Production Insight
A job stuck in STARTING for more than a few minutes usually means the agent machine is unreachable or the AutoSys agent is down.
A job that flips between RUNNING and TERMINATED repeatedly is probably being killed by an external watchdog.
Rule: use autorep -j JOB_NAME -l020 to see the last run's exit code and log; use -l030 for the full history.
Key Takeaway
Status + log = 90% of diagnosis.
S (SUCCESS) means exit code 0; FA (FAILURE) means non-zero.
ON ICE and ON HOLD behave differently — learn the distinction.

Why cron breaks at scale and WLA doesn't

Cron works fine for a dozen jobs on one box. The moment you cross a hundred jobs spread across data centers, cloud instances, and on-prem mainframes, cron becomes a liability. There's no global dependency graph, no retry logic, no alerting pipeline. A job fails at 3 AM and the next 47 downstream jobs fail silently. That's the gap Workload Automation (WLA) fills.

AutoSys is an enterprise WLA engine. It doesn't just run jobs on a timer — it evaluates dependencies, respects calendars, reroutes on failure, and centralizes logging. Think of it as an event-driven state machine for your batch processing. Every job registers with an agent, the agent reports status to a central event processor, and that processor decides what to spawn next. No polling loops. No SSH cron hacks. Just declarative job definitions that the system turns into execution guarantees.

Enterprises adopt AutoSys because their batch windows shrink while data volumes explode. Cron can't scale horizontally. AutoSys can. You add agents, not rewrite scripts.

ScalingJobDependencies.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// io.thecodeforge — devops tutorial
// Why cron fails: no dependency resolution

jobs:
  - name: cron_equivalent
    command: /opt/scripts/ingest.sh
    # cron: '0 2 * * *'  # runs daily at 2AM
    # If this fails, nothing downstream knows

  - name: downstream_job
    command: /opt/scripts/transform.sh
    # runs at 3AM, assumes upstream succeeded
    # hidden race condition

// AutoSys version:
insert_job: nightly_data_pipeline
job_type: box

insert_job: ingest_data
job_type: command
command: /opt/scripts/ingest.sh
condition: ?

insert_job: transform_data
job_type: command
command: /opt/scripts/transform.sh
condition: s(ingest_data)

insert_job: load_data
job_type: command
command: /opt/scripts/load.sh
condition: s(transform_data)
Output
AutoSys: transform_data runs only after ingest_data completes SUCCESS
Cron: transform_data runs at fixed time regardless — silent failure
Production Trap:
Job A runs every night at 2 AM. Job B runs at 3 AM. One Sunday night, A fails silently because of a DB lock. B runs, processes an empty file, and loads garbage into production. Nobody notices until Monday morning reporting shows corrupted aggregates. Cron gives you zero traceability for this. AutoSys gives you a status tree.
Key Takeaway
If you need dependency chains across machines, auto-retry, or incident alerts — cron is a toy. Use WLA.

AutoSys's dirty secret — the event processor bottleneck

Everyone talks about AutoSys like it's magic. It's not. The event processor (the central brain) is a single point of failure and a performance bottleneck. Every job status change, every alert trigger, every calendar check hits this process. If it crashes or gets overloaded, your entire batch pipeline goes dark. No jobs start, no status updates flow, and the ops page lights up like a Christmas tree.

Smart teams run redundant event processors in active-passive mode. They also throttle cross-instance dependencies to avoid cascading failures. The real skill is not writing job definitions — it's designing your dependency graph so one slow job doesn't freeze the entire pipeline. Use time conditions as escape hatches. Never chain more than three jobs deep without a checkpoint.

Also, AutoSys agents can run on anything — Linux, Windows, z/OS — but they poll the event processor. Polling interval matters. Too fast kills CPU, too slow introduces minutes of lag. Tune it. Defaults are for demos, not production.

EventProcessorRedundancy.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// io.thecodeforge — devops tutorial
// Active-passive event processor config

# Primary event processor:
event_processor:
  host: autosys-prime-prod-01
  port: 4444
  mode: active
  heartbeat_interval: 15  # seconds
  notification_delay: 30   # seconds before failover

# Standby event processor:
event_processor:
  host: autosys-standby-prod-01
  port: 4444
  mode: passive
  heartbeat_check: 20  # seconds
  auto_failover: true

# Job definition that uses time condition as escape:
insert_job: daily_sales_report
job_type: command
command: /app/scripts/sales_rollup.sh
condition: s(reports_ready) | t(06:00)
# if reports_ready never finishes, start by 6 AM anyway
Output
If primary event processor goes down:
- Standby detects missed heartbeats within 20s
- Standby becomes active
- All agents reconnect to new processor
- Downtime window: 30-45 seconds
- Without redundancy: outage until manual restart
Senior Shortcut:
Monitor event processor CPU and queue depth in your observability stack. If it sits above 80% for more than 5 minutes, split your job definitions across two event processors. One hot brain != scalable.
Key Takeaway
AutoSys scales horizontally on agents — the event processor is your single thread. Redundancy and monitoring are non-negotiable.

Scripting: The Thin Line Between Automation and Technical Debt

AutoSys runs jobs. But how those jobs are defined, what they execute, and how they fail is entirely driven by scripts. If you treat AutoSys as a black box that just runs shell scripts, you're setting yourself up for production fires.

Your job scripts need to handle exit codes explicitly. AutoSys doesn't guess — it reads the exit code from your process. A non-zero exit? That job goes to FAILURE unless you've defined an exit code mapping. Most teams forget this, then wonder why restart logic fails.

Scripts should be stateless, idempotent, and log to stdout/stderr with timestamps. AutoSys captures job output into spool files. Use that. If your script writes to random /tmp files and doesn't clean up, you'll fill the disk on the agent machine. I've seen it. Twice.

Wrap critical jobs in retry logic inside the script, not just in AutoSys JIL. AutoSys retry is blunt — it re-runs the whole command. Script-level retry gives you granular control: retry on specific exit codes, with exponential backoff, without re-triggering downstream dependencies.

script_hardening_example.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// io.thecodeforge — devops tutorial
// Script exit code handling for AutoSys production jobs
// Every job must define exit codes that map to AutoSys statuses

script:
  exit_codes:
    0: SUCCESS
    1: FAILURE       # generic error
    2: RESTART       # retry AutoSys job (use sparingly)
    3: TERMINATE     # stop the job chain
  job_wrapper:
    timeout: 3600
    log_prefix: "[${JOB_NAME}][${AUTORUN}]"
    retry:
      count: 3
      backoff: "exponential"
      on_exit_codes: [1, 2]
Output
AutoSys reads exit code 0 -> SUCCESS. Exit code 1 -> FAILURE (job stops). Exit code 2 -> RESTART (AutoSys retries the job, respecting max_retry in JIL). Exit code 3 -> TERMINATE (kills the box and all downstream jobs).
Production Trap: Exit Code 137
When your script is killed by OOM killer, the exit code is 137 (128+9). AutoSys sees FAILURE, not RESTART. If your job depends on memory, wrap it in a monitor script that catches 137 and emits exit code 2 instead to trigger a restart.
Key Takeaway
Your AutoSys job is only as reliable as the script it runs. Master exit codes, or master firefighting.

Docker: AutoSys Can Run Containers. Most Teams Do It Wrong.

AutoSys agents can execute Docker containers as jobs. But don't treat it like a magic wand. The why is simple: AutoSys is an orchestrator, not a scheduler for ephemeral processes. If you're running containers, you're offloading environment management to Docker, but AutoSys still owns lifecycle and dependencies.

The how: Your job command becomes docker run with a specific image tag. No latest. Ever. The agent needs access to the Docker socket — that's a security concern. Most enterprises isolate this via dedicated agents or Docker-in-Docker setups.

Critical: AutoSys cannot see inside the container. The job status depends entirely on the exit code of the docker run command. If the container starts but the app inside crashes, the container exits with non-zero, and AutoSys sees failure. You lose all stdio visibility unless you mount volumes for logs.

Use --rm flag to clean up containers. I've seen agent hosts fill up with dead containers because someone forgot. Also, mount a shared volume for logs and tell AutoSys to tail that file for real-time output. Otherwise, you're debugging blind.

docker_job_example.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// io.thecodeforge — devops tutorial
// AutoSys job running a Docker container
// Never use 'latest' tag. Pin to SHA or semantic version.

insert_job: nightly_data_pipeline
job_type: CMD
command: >
  docker run --rm
  --name nightly_pipeline_${AUTORUN}
  -v /data/pipeline/logs:/app/logs
  -e JOB_NAME=${JOB_NAME}
  -e RUN_ID=${AUTORUN}
  registry.internal.io/pipeline-worker:1.2.3
machine: docker-agent-01
timezone: UTC
condition: success(previous_job)
max_run_alarm: 3600
alarm_if_fail: 1
std_out: /data/pipeline/logs/job_${AUTORUN}.log
std_err: /data/pipeline/logs/job_${AUTORUN}.err
Output
AutoSys executes the 'docker run' command. Container exits with 0 -> job SUCCESS. Container exits non-zero -> job FAILURE. Stdout/stderr go to mounted volume, not AutoSys spool. Logs are persistent for debugging.
Senior Shortcut: Use a Wrapper Script
Key Takeaway
Docker adds portability but removes observability. Mount logs, pin image tags, and never forget --rm.

Networking: Why AutoSys jobs fail when your network doesn't

AutoSys relies on persistent network connections between the Event Processor, Remote Agents, and file servers. When a job runs on a remote machine, the agent must both pull job definitions from the event processor and write stdout/stderr back. If packet loss exceeds 0.1% or latency spikes above 50ms, jobs fail with TERMINATED or INACTIVE statuses—no retry logic exists by default. Worse, DNS timeouts on agent startup cause silent drops: the job never starts but shows no error. The fix is not to increase socket timeouts. Instead, implement local job wrappers that write logs to a shared NFS mount, bypassing the agent’s network dependency. Also configure ALARM notifications for agent connectivity, not job failures. Most teams ignore this until a network partition kills 2,000 jobs simultaneously.

RemoteAgent_Network_Check.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
// io.thecodeforge — devops tutorial

autosys_profile:
  agent_network_timeout: 30  # seconds, default 120
  dns_resolve_retries: 3
  fail_on_dns_failure: false  # job continues with IP cache

job_wrapper:
  local_log_path: /var/tmp/autosys_job_${JOBNAME}.log
  nfs_mount: /mnt/shared/autosys_logs
  post_run: cp ${local_log_path} ${nfs_mount}/
Output
Warning: Setting fail_on_dns_failure=false may hide actual DNS outages.
Production Trap:
AutoSys agents use only one TCP connection to the event processor. If that breaks, all jobs on that agent freeze—no new jobs start, no status updates reach the scheduler. Always run two agents per machine in failover mode.
Key Takeaway
Never trust the network between agent and scheduler; always log locally first.

Kubernetes: AutoSys as a job orchestrator, not a container scheduler

AutoSys can launch Kubernetes batch jobs via cmd: kubectl run. But teams make the same mistake: they treat AutoSys as a Kubernetes scheduler. AutoSys should only trigger jobs based on time or events—Kubernetes handles pod placement and retries. The reality is that AutoSys lacks native pod status awareness. When a pod runs beyond its timeout, AutoSys marks the job FAILED even if the pod later succeeds. Solution: wrap the kubectl command with a polling loop that checks pod phase every 5 seconds and only exits when the pod reaches Succeeded or Failed. This turns AutoSys into a pure trigger, preventing false negatives. Never set max_run_alarm to the pod's expected runtime—add a 30% buffer. Also, pin event processors to three replicas with anti-affinity; single-point failures here will orphan every scheduled Kubernetes job.

AutoSys_K8s_Wrapper.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
// io.thecodeforge — devops tutorial

job_name: k8s_batch_job
cmd: |
  /usr/local/bin/kubectl_poll.sh \
    --namespace=production \
    --job=my-batch-job \
    --poll-interval=5 \
    --timeout=3600
timezone: UTC
start_mins: 15
condition: s(previous_job)
max_run_alarm: 4680  # 3600 + 30% buffer
Output
Script kubectl_poll.sh calls 'kubectl get pod -l job-name=my-batch-job -o jsonpath="{.status.phase}"'. Exits 0 only if phase equals 'Succeeded'.
Production Trap:
AutoSys sends a SIGTERM to the wrapper script max_run_alarm seconds after start. If your kubectl poll loop does not trap SIGTERM, the pod continues but AutoSys sees failure. Always add 'trap "exit 0" TERM' after the pod succeeds.
Key Takeaway
AutoSys triggers Kubernetes jobs; Kubernetes runs them. Never let AutoSys own pod lifecycle.
● Production incidentPOST-MORTEMseverity: high

The Silent Pipeline Failure: A Hung Job Blocks Overnight Batch

Symptom
Batch pipeline frozen at 3:17 AM. All downstream jobs in 'PENDING' status. The upstream job shows 'RUNNING' status but no CPU or I/O activity on the agent machine.
Assumption
The job was a simple database procedure call. Since it ran successfully thousands of times, the team assumed it would never hang.
Root cause
No max_run_alarm configured on the job. The stored procedure entered an infinite loop due to a data anomaly (unexpected NULL in a column used in the loop condition). AutoSys never terminated it because the job process was still alive — just not progressing.
Fix
Add max_run_alarm: 60 (minutes) to the JIL definition and configure terminate_on_max_run: yes. Set an alert on the alarm to page the on-call engineer. Also added a guard in the stored procedure to detect and exit on NULL values.
Key lesson
  • Always set max_run_alarm on every CMD job that calls external processes — even 'simple' database calls can hang.
  • A job in RUNNING status doesn't mean it's making progress. Monitor CPU and database activity separately.
  • max_run_alarm without terminate_on_max_run just warns — it doesn't fix the problem.
Production debug guideSymptom → Action patterns for the most common production issues4 entries
Symptom · 01
Job shows RUNNING but no activity on agent machine
Fix
Check autorep -q JOB_NAME to see current status. Then autorep -j JOB_NAME -l020 to get last run log. Look for process ID (PID) and verify if it's still alive on the agent. If the process exited but AutoSys didn't detect it, the agent may need a restart.
Symptom · 02
Job fails with status TERMINATED immediately after start
Fix
Run autorep -j JOB_NAME -l020. Common causes: missing environment variables, incorrect working directory, or insufficient permissions. Check the agent machine's syslog for segfaults or permission denials.
Symptom · 03
Box job stuck in ACTIVATED/STARTING status for hours
Fix
A box job needs at least one child job inside to start. If the box has no children or all children are ON ICE, the box will never start. Run autorep -q BOX_NAME to list children and check their statuses. If children are missing, re-insert them.
Symptom · 04
Daily job runs on wrong day or not at all
Fix
Check date_conditions and days_of_week in the JIL. Verify the calendar (if any) using autorep -q CAL_NAME. Is the job ON ICE or ON HOLD? Use sendevent to view current hold/ice status.
★ AutoSys Job Failure Quick Debug Cheat SheetFive commands that diagnose 90% of production AutoSys issues
Job status unknown or not running as expected
Immediate action
Check job definition and status
Commands
autorep -q JOB_NAME
autorep -j JOB_NAME -l020
Fix now
If job is ON ICE, use sendevent -E FORCE_STARTJOB -J JOB_NAME to override ice temporarily.
Job failed with exit code but no log+
Immediate action
Get full event history
Commands
autorep -j JOB_NAME -l020 | tail -50
autorep -j JOB_NAME -d | grep -i error
Fix now
Check stdout/stderr on agent machine in the job's working directory.
Job stuck in RUNNING+
Immediate action
Kill the hung process
Commands
sendevent -E JOB_ON_ICE -J JOB_NAME
sendevent -E KILLJOB -J JOB_NAME
Fix now
If KILLJOB doesn't work, SSH to agent and kill the PID found in autorep -j output.
Dependent job never starts even after upstream succeeds+
Immediate action
Verify the condition is met
Commands
autorep -j UPSTREAM_JOB -q | grep status
autorep -j DOWNSTREAM_JOB -l030 | tail -20
Fix now
If condition uses 's' but the upstream finished with non-zero exit code, the condition will never trigger. Change to 'o' (completion regardless of exit code) or add retry.
FeatureAutoSysCronWindows Task Scheduler
Job dependenciesYes — full condition logicNoLimited
Cross-server schedulingYes — centralisedNo — per serverNo — per machine
GUI / dashboardYes — WCC GUINoYes — basic
Retry on failureYes — configurableNo — manualLimited
Audit trailYes — full event logNoLimited
Alert on overrunYes — max_run_alarmNoNo
Hold/suspend jobsYes — ON HOLD / ON ICENo — must deleteLimited

Key takeaways

1
AutoSys is an enterprise workload automation platform for scheduling, monitoring, and managing batch jobs across multiple servers.
2
It solves the cross-server dependency and centralised monitoring problem that cron simply can't handle at scale.
3
Jobs are defined using JIL (Job Information Language), a scripting language specific to AutoSys.
4
It's heavily used in banking, insurance, telecoms, and other industries that run large overnight batch workloads.
5
AutoSys is now owned by Broadcom and actively developed as of 2026.
6
Always set max_run_alarm. Always test with the service account. Never confuse ON HOLD and ON ICE.

Common mistakes to avoid

5 patterns
×

Treating AutoSys like cron

Symptom
Jobs fail or behave unexpectedly because you're using cron's mental model — like not respecting the job lifecycle or forgetting that conditions are evaluated by the event server, not the agent.
Fix
Learn AutoSys's state machine and JIL syntax. Realise that AutoSys jobs persist in the database and can be scheduled without being in a file. You don't 'edit a crontab' — you update JIL definitions.
×

Forgetting that jobs run under a service account

Symptom
Permission denied errors on file reads, database connections, or script execution on first run. The job works when you test it manually (as yourself) but fails in AutoSys.
Fix
Always test your job command with the exact service account. Use sudo -u svcaccount /path/to/command in a test environment. Ensure the account has the required permissions on the agent machine and network.
×

Not setting max_run_alarm

Symptom
A job that hangs (e.g., infinite loop in a script) runs forever. Downstream jobs are blocked waiting for it. No alert fires until morning when someone notices the pipeline is frozen.
Fix
Add max_run_alarm to every CMD job. Set a reasonable value based on historical run times plus a buffer. Consider terminate_on_max_run: yes for critical jobs.
×

Confusing ON HOLD and ON ICE

Symptom
A job placed ON HOLD runs automatically when its condition is met, causing an unexpected execution at 3am. Or a job placed ON ICE never runs despite all conditions being satisfied.
Fix
Memorise: ON HOLD = defer until condition clears; ON ICE = block permanently until forced. To keep a job from running indefinitely, use ON ICE. To delay it until a specific event, use ON HOLD.
×

Not using conditions correctly with status characters

Symptom
Condition success(jobA) fails to trigger because jobA exited with non-zero but you only care about completion. Or condition st(jobA).st(jobB) causes both to run before they're both ready.
Fix
Learn the condition syntax: 's' = success, 'f' = failure, 'o' = completion (any exit), 'e' = non-zero exit. Use o when you don't care about exit code. Combine with and/or operators.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR
What is AutoSys and what problems does it solve that cron cannot?
Q02JUNIOR
Name the three types of jobs in AutoSys.
Q03SENIOR
What is JIL and how is it used to define jobs?
Q04JUNIOR
In what industries is AutoSys most commonly used and why?
Q05SENIOR
What is the Event Server and what does it store?
Q06SENIOR
What is the difference between ON HOLD and ON ICE?
Q01 of 06JUNIOR

What is AutoSys and what problems does it solve that cron cannot?

ANSWER
AutoSys is an enterprise workload automation platform. It solves cross-server dependency management, centralised monitoring, retry logic, audit trails, and complex scheduling that cron cannot handle at scale. Cron is per-server, has no dependency model, no central dashboard, and no built-in alerting.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
Is AutoSys still relevant in 2026?
02
Do I need to know Linux to use AutoSys?
03
What is the difference between CA AutoSys and Broadcom AutoSys?
04
Can AutoSys run Python scripts?
05
How do I debug a job that fails with exit code 1?
COMPLETE GUIDE
The Complete AutoSys Workload Automation Guide for Engineers →

JIL syntax, sendevent, autorep, box jobs, file watchers, scheduling, HA, security, cloud workload automation, and 22 interview questions — the definitive AutoSys reference for production engineers.

N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Written from production experience, not tutorials.

Follow
Verified
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
🔥

That's AutoSys. Mark it forged?

7 min read · try the examples if you haven't

Previous
Google Cloud Storage and BigQuery Overview
1 / 30 · AutoSys
Next
AutoSys Architecture and Components