Definitive Guide · TheCodeForge

The Complete
AutoSys Guide
for Engineers

Verified in production environments · Enterprise-grade examples
Last updated:

Everything you need to master CA AutoSys Workload Automation — from JIL syntax to production incident recovery. Written for engineers who actually run these systems at scale.

20Core Sections
40+Commands & Examples
15Debug Patterns
20+Interview Qs

What AutoSys Actually Is

AutoSys (officially CA Workload Automation AE, owned by Broadcom) is an enterprise job scheduler built for complex workloads across distributed systems — batch jobs, file watchers, multi-step workflows, and event-driven pipelines. It's the backbone of nightly batch processes in investment banks, insurance companies, telcos, and healthcare — the infrastructure behind payroll runs, bank settlements, ETL pipelines, and regulatory reporting at 2am. If you've done enterprise application support and something broke in the night, there's a fair chance AutoSys was somewhere in the chain.

In most enterprises it's the central platform for workload automation — the thing that coordinates hundreds of jobs across dozens of servers and makes sure they run in the right order, on the right machine, at the right time. You'll also encounter it referred to as Workload Automation AE, and in older shops as the predecessor to what Broadcom now calls the Universal Automation Center. When it works, nobody notices. When it breaks, everything stops.

Plain English
"Think of AutoSys as a highly opinionated cron — but one where jobs know about each other, dependencies are enforced across servers, and when something breaks at 3am, the on-call engineer gets paged with the exact job name, exit code, and failure reason."

What separates AutoSys from cron is that jobs are event-driven, not just time-driven. A job doesn't fire because the clock hit 22:00 — it runs when a set of conditions are all true: another job succeeded, a file landed in a directory, a global variable was set by an upstream process, or an external system sent a trigger. That event management capability is what makes AutoSys the coordination layer for complex workflows where ordering, cross-server dependencies, and integration between systems all matter.

AutoSys vs Cron vs Control-M vs Automic

FeatureCronAutoSysControl-MAutomicApache Airflow
Job dependenciesNoneFull condition logicFull condition logicFull condition logicDAG-based dependencies
Cross-server jobsPer-server onlyRemote agentsRemote agentsRemote agentsCelery / Kubernetes executors
File watchersNoNative FW job typeNativeNativeSensors (FileSensor)
AlertingManual setupBuilt-in alarmsBuilt-inBuilt-inCallbacks + PagerDuty hooks
Restart/recoveryManualsendevent commandsGUI + commandsGUI + commandsTask retry + backfill
Audit trailLogs onlyFull event history in DBFull event historyFull event historyMetadata DB + UI logs
LicensingFree / OSCommercial (Broadcom)Commercial (BMC)Commercial (Broadcom)Open-source (Apache 2.0)
Pipeline definitioncrontabJILGUI / XMLGUI / scriptsPython DAGs
Typical useSimple scheduled tasksEnterprise batch pipelinesEnterprise batch pipelinesEnterprise automation / SAP workloadsData engineering / ML pipelines

AutoSys vs Apache Airflow — The Real Decision

The comparison table above shows the surface differences. The real question enterprises face in 2025–2026 is whether to keep AutoSys for existing batch pipelines or migrate to Airflow for new workloads — or run both. Here's the honest answer based on what's actually happening in production environments.

DimensionAutoSysApache AirflowWhen it matters
Job definitionJIL — declarative text, version-controlled, environment-portablePython DAGs — full programming language, more expressive but harder to auditCompliance-heavy shops prefer JIL's audit trail. Data engineering teams prefer Python DAGs.
Cross-server executionNative — agents on any OS, no extra configRequires Celery/K8s executor setup and worker infrastructureLegacy mixed-OS environments (Linux + Windows + AIX) — AutoSys wins cleanly
File-based triggersNative FW job type — no code requiredFileSensor — requires Python, polling config, and operator knowledgeFile-arrival pipelines in banking/insurance — AutoSys is simpler to operate
Failure recoverysendevent FORCE_STARTJOB with -D date flag — ops team can recover without dev involvementClear/re-run from UI or CLI — requires understanding DAG state24/7 ops teams without Python skills — AutoSys recoveries are faster to execute
ML/data pipelinesPossible via CMD jobs calling Python — awkwardNative — PythonOperator, SparkSubmitOperator, dbt integrationNew ML workloads should go on Airflow. Retrofitting AutoSys is painful.
ScalabilityEAS is a single-process bottleneck — high job counts require careful tuningHorizontally scalable with Celery/K8s workers10,000+ concurrent tasks — Airflow scales better
Vendor dependencyBroadcom licensing — price increases post-VMware acquisition have accelerated migration interestApache 2.0 — no licensing costCost-reduction initiatives — Airflow migration is increasingly common
Operational complexityLow for ops teams — JIL is approachable, CLI tools are consistentHigher — requires Python knowledge across the ops teamShops without dedicated platform engineering — AutoSys is easier to hand off
💡
The realistic 2026 answer: Most enterprises running AutoSys are not replacing it — they're running Airflow alongside it for new data engineering workloads while keeping AutoSys for existing batch pipelines. The migration cost, operational risk, and retraining effort for a 500-job AutoSys environment is rarely justified by the benefits unless you're also rebuilding the underlying jobs. If you're starting a new pipeline today with no AutoSys dependency, use Airflow. If you're maintaining or extending existing AutoSys workloads, learn JIL properly — it'll serve you better than a half-finished migration.

AutoSys Workload Automation — Architecture Deep Dive

CA Workload Automation AE has three main components: the Event Server (EAS), Remote Agents, and the AutoSys database. Understanding how they interact is essential for diagnosing failures and designing reliable pipelines. Together they give you end-to-end visibility of workloads running across your entire infrastructure — from on-premises Linux servers to cloud agents.

AutoSys Component Architecture Flowchart showing three input sources connecting to the Event Server, which dispatches jobs to three remote agents, all writing results to a central database. thecodeforge.io AUTOSYS COMPONENT ARCHITECTURE sendevent CLI WAAE Web UI External App Trigger events & commands Event Server (EAS) schedules, evaluates conditions, dispatches dispatches jobs via TCP Remote Agent (Linux) Remote Agent (Windows) Remote Agent (AIX/HP-UX) job results and status written to AutoSys Database (Oracle or MSSQL) Event Server reads/writes all job state to the DB. Agents are stateless — they execute and report back. © thecodeforge.io

The Event Server

The Event Server (EAS) is the scheduler's brain — it polls the database for new events, evaluates which jobs are ready to run based on their conditions, and dispatches them to agents. If the EAS process dies, nothing runs. Jobs don't error out; they just stop. This is why the first thing you check during any "why isn't this running?" investigation is whether EAS is actually alive: ps -ef | grep EAS.

Remote Agents

Agents are stateless daemons on target machines. They wait for dispatch instructions from the EAS, execute the job command under the configured owner account, and write the exit code back to the database. That's it. They don't have opinions about what they're running — if the command path is wrong or the owner doesn't exist, the job silently hangs in STARTING. The agent has no way to tell the EAS "I tried but couldn't start this."

💡
On telemetry: AutoSys 12.0+ includes a telemetry capability that sends product usage data to Broadcom. It's enabled by default in newer installations. This is worth knowing if your organisation has strict data egress policies — check with your AutoSys admin team whether it's been disabled in your environment, and whether it needs to be in your security baseline.
Production Incident — Agent Connection Failure

Symptom: Jobs targeting a specific machine are stuck in STARTING state for 10+ minutes, then fail with "connection refused."

Root cause: The Remote Agent daemon on the target machine crashed after an OS patch restart. The Event Server couldn't reach it.

Fix: SSH to the target machine, check agent status with ps -ef | grep cybAgent, restart with service cybagent restart, then force-start the affected jobs.

Prevention: Add agent health monitoring to your infrastructure alerting. A dead agent is silent — it doesn't alert AutoSys directly.

JIL Syntax — The Complete Reference

JIL (Job Information Language) is the scripting language used to define every object in AutoSys — jobs, boxes, file watchers, global variables, machine registrations. Every batch job, every file watcher, every scheduling rule in the system is defined using Job Information Language. It looks like a config file format, not a programming language, but don't underestimate it. A single missing attribute or a wrong machine alias can cause a job to silently never run. Learn to read JIL fluently; learn to write it carefully.

BOX Job — Pipeline Container

payment_eod.jil
JIL
/* BOX — container for the entire EOD pipeline */
insert_job: BOX_PAYMENT_EOD   job_type: BOX
owner:            svc_autosys
permission:       gx,ge,wx,we,mx,me
date_conditions:  1
days_of_week:     mo,tu,we,th,fr
start_times:      "22:00"
timezone:         GMT
description:      "End-of-day payment settlement pipeline"
alarm_if_fail:    1
alarm_if_terminated: 1

CMD Job — Shell Command Execution

extract_transform_load.jil
JIL
/* Step 1: Extract — no condition, runs when box starts */
insert_job: JOB_EXTRACT_TXN   job_type: CMD
box_name:   BOX_PAYMENT_EOD
command:    /opt/etl/extract_transactions.sh
machine:    db-extract-01
owner:      svc_autosys
std_out_file: /logs/autosys/extract_txn.out
std_err_file: /logs/autosys/extract_txn.err
max_run_alarm: 60

/* Step 2: Transform — depends on extract success */
insert_job: JOB_TRANSFORM_TXN   job_type: CMD
box_name:   BOX_PAYMENT_EOD
command:    /opt/etl/transform.sh
machine:    etl-proc-01
owner:      svc_autosys
condition:  success(JOB_EXTRACT_TXN)
max_run_alarm: 45

/* Step 3: Load — depends on transform, alerts on failure */
insert_job: JOB_LOAD_TXN   job_type: CMD
box_name:   BOX_PAYMENT_EOD
command:    /opt/etl/load_to_dw.sh
machine:    dw-loader-01
owner:      svc_autosys
condition:  success(JOB_TRANSFORM_TXN)
alarm_if_fail: 1
max_run_alarm: 30
n_retrys:   2
retry_interval: 5
Always set max_run_alarm on load jobs. A hung database connection keeps a CMD job in RUNNING state indefinitely — the box never completes and no alert fires. 30 minutes is a reasonable default for most ETL loads.

File Watcher Job

file_watcher.jil
JIL
/* File watcher — triggers when file arrives */
insert_job: FW_PAYMENT_FILE   job_type: FW
box_name:        BOX_PAYMENT_EOD
watch_file:      /data/incoming/payment_*.csv
watch_interval:  60
owner:           svc_autosys
machine:         file-drop-01
min_file_size:   1

/* Process job runs only after file arrives */
insert_job: JOB_PROCESS_PAYMENT   job_type: CMD
box_name:   BOX_PAYMENT_EOD
command:    /opt/payment/process.sh
machine:    proc-01
condition:  success(FW_PAYMENT_FILE)

AutoSys Workload Automation — Job Status Codes Explained

Every AutoSys job cycles through a series of status codes. Knowing what each one means — and what caused it — is the core skill for on-call AutoSys support.

StatusMeaningCommon Cause
INACTIVEJob exists but has never run in this cycleNormal initial state
ACTIVATEDBox started, job is waiting for its conditionsNormal — waiting for dependencies
STARTINGEvent Server dispatching to remote agentNormal, should be brief (<30s)
RUNNINGExecuting on remote agentNormal during execution
SUCCESSCompleted with exit code 0Normal completion
FAILURECompleted with non-zero exit codeScript error, bad data, permissions
TERMINATEDKilled via sendevent KILLJOBManual intervention or max_run_alarm
ON_HOLDHeld by operator — will not runManual hold, maintenance window
ON_ICEFrozen — invisible to schedulerManual skip, downstream proceeds
RESTARTWaiting to be restarted after failuren_retrys configured, retry pending
💡
STARTING stuck for >5 minutes almost always means the remote agent is down or unreachable. Check the agent first before investigating the job itself.

sendevent — The On-Call Toolkit

When something breaks in production, sendevent is how you fix it. It's AutoSys's event management interface — every manual intervention, every status change, every pipeline recovery goes through this command. Learn these cold. You will not have time to look them up at 3am with a box stuck, an SLA breaching, and stakeholders pinging you.

Command
What It Does
sendevent -E FORCE_STARTJOB -J <job>
Force start immediately
Bypasses all conditions and time windows. Use when a dependency already completed but status wasn't captured.
sendevent -E KILLJOB -J <job>
Kill a running or hung job
Sends SIGTERM to the process on the remote agent. Job status becomes TERMINATED.
sendevent -E CHANGE_STATUS -J <job> -s SUCCESS
Manually mark as SUCCESS
Critical recovery tool. Use when work was completed manually and you need the pipeline to proceed.
sendevent -E ON_HOLD -J <job>
Put job on hold
Job won't run until released. Downstream jobs wait. Persists across box restarts.
sendevent -E ON_ICE -J <job>
Freeze job completely
Job is invisible to the scheduler. Downstream jobs behave as if it already succeeded.
sendevent -E JOB_OFF_HOLD -J <job>
Release job from hold
Job re-enters scheduling. If conditions are met, it will start on the next evaluation cycle.
sendevent -E JOB_OFF_ICE -J <job>
Unfreeze job
Job re-enters scheduler consideration. Does not automatically trigger a run.
sendevent -E STARTJOB -J <box>
Start a box manually
Starts the box outside its normal schedule. All child jobs will run per their conditions within the box.
Production Incident — Wrong Recovery Tool

Scenario: JOB_LOAD_TXN failed at 23:45 due to a tablespace issue. The DBA fixed the tablespace and manually inserted the data directly. The EOD box is stuck waiting for the load job to succeed.

Wrong fix: sendevent -E FORCE_STARTJOB -J JOB_LOAD_TXN — this re-runs the load script, which will attempt to insert the data again and create duplicates.

Correct fix: sendevent -E CHANGE_STATUS -J JOB_LOAD_TXN -s SUCCESS — marks the job as succeeded without running it, allowing downstream jobs to proceed. No duplicate data.

autorep — Monitoring & Reporting

autorep is your read-only window into the AutoSys database — job status, run history, definitions, machines, calendars, global variables. It's the primary monitoring tool for workload automation on the command line, giving you full visibility into what's running, what failed, and what's scheduled next. You can't break anything with it. Run it freely, run it often.

autorep_commands.sh
BASH
# Show current status of a specific job
autorep -J JOB_LOAD_TXN -s

# Show all jobs in a box with their current status
autorep -J BOX_PAYMENT_EOD -s

# Show job definition (JIL attributes)
autorep -J JOB_LOAD_TXN -d

# Show run history for a job (last 10 runs)
autorep -J JOB_LOAD_TXN -s -t

# Show all FAILURE jobs in the last 24 hours
autorep -J % -s | grep FAILURE

# Show all jobs on a specific machine
autorep -M db-extract-01 -s

# Show all currently RUNNING jobs
autorep -J % -s | grep RUNNING

# Show jobs that ran between specific times
autorep -J BOX_PAYMENT_EOD -s -t -S 2200 -E 2359

# Export job definition back to JIL format
autorep -J BOX_PAYMENT_EOD -q
🔍
autorep -J % -s | grep FAILURE is your first command every morning during on-call. The % wildcard matches all jobs. Pipe to grep to filter by status. Add | wc -l to get a count.

AutoSys Web UI — The Graphical Interface

Most engineers learn AutoSys through the command line, which is fine — the CLI is faster for day-to-day support. But the AutoSys Web UI (formerly Workload Control Center / CA WCC) is worth knowing because it's what business stakeholders, junior operators, and managers use. It provides real-time visibility across all your workloads — job status, alerts, dependencies, overdue jobs — from a single browser interface. In some shops it's also how job definitions get submitted via Quick Edit. If you're supporting a team, you need to know both.

The AutoSys Web UI connects to one or more Event Server instances and gives you a browser-based view of the same data autorep shows on the command line. The underlying database is identical — Web UI doesn't have its own state.

Key Web UI Panels

PanelWhat It DoesCLI Equivalent
MonitoringLive view of job status across instances. Spot overdue jobs, set up alert subscriptions, filter by box, machine, or status.autorep -J % -s
Quick ViewSingle-job deep dive — definition, conditions, run history, logs, and flow diagram. Send events directly from here without touching the CLI.autorep -J job -f -d
Quick EditEdit a job definition directly from the AutoSys Web UI — change attributes, conditions, or scheduling without writing JIL manually.jil (update_job)
Application EditorVisual dependency graph — shows which jobs depend on which, conditions, and successor chains. Invaluable for impact analysis before changes.job_depends -J box
Enterprise Command LineBrowser-based terminal. Run jil, autorep, sendevent on the connected server without SSH access.Direct CLI
ForecastShows upcoming scheduled runs for a job or box. Use it to spot overdue jobs before they become incidents and validate calendar logic before go-live.forecast -J box -t 30
ResourcesVisual resource utilisation — how many units are consumed vs available right now.autorep -r RESOURCE -s
CredentialsManage per-server authentication credentials. Add/update/validate the owner account passwords that agents use to execute jobs.N/A (admin only)
💡
Application Editor is underused and extremely useful. When someone asks "if I force-start this job, what downstream jobs will run?" — the Application Editor answers that in seconds. It renders the full condition dependency graph visually, including cross-box dependencies that are invisible in autorep output.
Enterprise Command Line in Web UI is your friend in locked-down environments. In banks and insurance companies, direct SSH to the AutoSys server is often restricted to a small group. If you have Web UI access, Enterprise Command Line lets you run autorep, sendevent, and jil through the browser — same commands, same results, no terminal required.
Heads Up — Web UI ≠ CLI for Job Definitions

The fields in Web UI map directly to JIL attributes — there's no separate "Web UI format." When you save a job definition in Web UI, it writes JIL to the database exactly the same as jil < script.jil would. You can define a job in Web UI and export it as JIL with autorep -J JOB_NAME -q.

One gotcha: Web UI may not expose every JIL attribute through its forms — particularly custom or advanced attributes. For anything beyond standard CMD/BOX/FW job types, use JIL directly. Don't assume every attribute is accessible through the GUI.

Production Debug Guide

These are the failure patterns that repeat across every enterprise AutoSys environment. Batch processes fail for predictable reasons — and once you've seen each of them once, you diagnose them in minutes instead of hours. Not theory — this is what actually breaks at 2am and costs operational efficiency when it does.

Pattern 1: Box stuck in RUNNING, all child jobs SUCCESS

Nine times out of ten, the box has a condition attribute of its own — referencing a job outside the box that hasn't completed yet. Run autorep -J BOX_NAME -d and look at the box's own condition line, not just its children. If you're in the AutoSys Web UI, Quick View on the box shows the same information with a visual flow diagram that makes cross-box dependencies immediately obvious. The other case is a child job added mid-cycle that never ran. Both look identical from the monitoring view.

Pattern 2: Job stuck in STARTING for more than 5 minutes

Start with autoping -m MACHINE_NAME. If it times out, the agent is dead — restart it and move on. If autoping succeeds, the agent is alive but still not executing. At that point, check whether the owner account exists on the target machine (ssh machine id svc_autosys). A missing owner is the most common cause of a persistent STARTING hang that survives an agent restart. If the owner exists, check that it has execute permission on the command path.

Pattern 3: Jobs in ACTIVATED but conditions look satisfied

Check global variables first — autorep -G %. If a condition includes value(GVAR_NAME) != "" and the variable is empty or stale, the job will sit in ACTIVATED indefinitely with no error. After that, check the Event Server itself: ps -ef | grep EAS. A live EAS process that can't reach the database queues everything internally without dispatching. Check /opt/CA/WorkloadAutomationAE/SystemAgent/logs/ for connection errors.

Production Incident — Calendar Mismatch

Symptom: End-of-month jobs didn't run. No failures, just no execution. Jobs show INACTIVE.

Root cause: The box had run_calendar: BUSINESS_DAYS and the last day of the month fell on a Saturday. The calendar had no entry for that date, so the box never activated.

Fix: Updated the calendar definition to include the last business day of each month using extended_calendar attributes. Manually force-started the box for the missed date using sendevent -E FORCE_STARTJOB -J BOX_EOM_REPORTS -D 20260330.

Lesson: Always run autorep -c CALENDAR_NAME -t against any new calendar and manually count the expected dates before it goes live. Month-end calendars are wrong more often than you'd expect.

Global Variables

Global variables are key-value pairs stored in the AutoSys database, readable and writable by any job on any machine at runtime. They're how you pass data between jobs in a pipeline — the run date, an input file path, a record count — without hardcoding values or relying on temp files on a shared filesystem.

Plain English
"Global variables are AutoSys's shared whiteboard. Any job can write a value on it, and any downstream job can read it. The extract job writes today's run date. The load job reads it. No hardcoding, no file passing."

Defining and Setting Global Variables

global_variables.jil
JIL
/* Define a global variable in JIL */
insert_global: GVAR_RUN_DATE
value: "20260417"

insert_global: GVAR_PAYMENT_FILE
value: ""

/* Reference a global variable in a job command */
insert_job: JOB_EXTRACT_TXN   job_type: CMD
command: /opt/etl/extract.sh $$GVAR_RUN_DATE
machine: db-extract-01

/* Use global variable in a condition */
insert_job: JOB_LOAD_TXN   job_type: CMD
command: /opt/etl/load.sh $$GVAR_PAYMENT_FILE
condition: success(JOB_EXTRACT_TXN) AND value(GVAR_PAYMENT_FILE) != ""

Setting Global Variables at Runtime

set_global_at_runtime.sh
BASH
# Set a global variable from within a running job script
sendevent -E SET_GLOBAL -G GVAR_PAYMENT_FILE -V "/data/incoming/payment_20260417.csv"

# Set from command line (operator intervention)
sendevent -E SET_GLOBAL -G GVAR_RUN_DATE -V "20260417"

# Read current value
autorep -G GVAR_RUN_DATE

# List all global variables
autorep -G %
Global variables in conditions use the value(GVAR_NAME) syntax. This lets you gate job execution on data state, not just job completion state — a powerful pattern for file-driven pipelines where you need to verify a file path was actually set before attempting to process it.
Production Gotcha — Global Variable Persistence

Problem: Global variables persist across box runs. If last night's run set GVAR_PAYMENT_FILE to a stale path and tonight's file watcher failed silently, the load job may process yesterday's file again.

Fix: Always reset critical global variables to empty at the start of each box run using a dedicated reset job as the first step, before any file watchers or extract jobs run.

Security Model — RBAC and EEM

AutoSys Workload Automation security gets ignored until something goes wrong — a developer accidentally force-starts a production settlement run, or an audit finds stale accounts with Admin access that nobody revoked. The security model has two layers that are worth understanding separately: JIL permissions (per-job, defined in the job definition) and EEM (centralized role-based access, mapped to your LDAP/AD groups).

JIL Permission Attributes

Every job definition includes a permission attribute that controls access at the job level. Permissions are set as a comma-separated list of access codes.

job_permissions.jil
JIL
insert_job: JOB_PAYMENT_EOD   job_type: CMD
owner:       svc_autosys
permission:  gx,ge,wx,we,mx,me

/* Permission codes:
   g = group, w = world, m = me (owner)
   x = execute (run/force-start)
   e = edit (modify JIL definition)
   r = read (view job definition)

   gx,ge = group can execute and edit
   wx,we = world can execute and edit
   mx,me = owner can execute and edit

   Production best practice: gx,ge only
   Never wx or we in production — too permissive */

EEM Role-Based Access Control

EEM is the modern RBAC layer. It allows administrators to define roles (operator, developer, admin, read-only) and map them to LDAP/AD groups. This means access is managed centrally via your directory service rather than per-job JIL attributes.

RoleTypical PermissionsUse Case
Read-OnlyView job status and history onlyBusiness analysts, audit
OperatorForce-start, hold, ice, change statusOn-call support engineers
DeveloperInsert/update job definitions in dev/testApplication developers
AdminFull access including delete and security changesAutoSys team only
💡
Principle of least privilege applies here. On-call engineers need Operator access, not Admin. Developers need Developer access in non-production only. Audit every Admin account quarterly — in most enterprises, there are always stale accounts with Admin access that were never revoked.

High Availability — Dual Event Server

Running a single Event Server is a calculated risk that most teams accept in dev and test. In production, it's a problem. When the EAS goes down, scheduling stops completely — no failover, no queuing to another server, nothing. You lose all visibility into running workloads and no new jobs dispatch until it recovers. AutoSys addresses this with a Primary/Shadow configuration: two Event Servers sharing the same database, with the Shadow ready to take over if the Primary disappears.

How Primary/Shadow Works

The Primary is the active scheduler. The Shadow runs in parallel, reading the same database, staying in sync. It watches the Primary's heartbeat — a regular signal written to the database. When the heartbeat stops, the Shadow waits a configurable timeout, then promotes itself to Primary and resumes dispatching. In practice, failover takes 30–60 seconds. Jobs already running on agents are unaffected — they complete independently and write results back to the database when done.

Dual Event Server — HA Architecture
Primary EAS (Active)
Shadow EAS (Standby)
↓ both read/write ↓
AutoSys Database (shared)
↓ Primary dispatches to
Remote Agents
On Primary failure → Shadow detects missing heartbeat → promotes to Primary → resumes scheduling within ~60s

Key Configuration Points

  • Both servers must have identical AutoSys software versions and patch levels
  • The database must be on shared/clustered storage accessible to both servers
  • Network latency between Primary and Shadow should be under 10ms — same datacenter preferred
  • Remote agents connect to a virtual hostname or load balancer VIP, not the physical server name — this survives failover transparently
  • Monitor heartbeat with autoping -m EAS_MACHINE_NAME — include this in your monitoring stack
Production Incident — Split Brain

Symptom: Jobs ran twice. Duplicate records in the data warehouse.

Root cause: Network partition between Primary and Shadow. Shadow couldn't see Primary's heartbeat, promoted itself. Network recovered — now both servers thought they were Primary and dispatched the same jobs.

Fix: Immediately shut down one EAS instance. Identify duplicated records via run history timestamps. Implement a fencing mechanism (STONITH or database-based lock) to prevent dual-active scenarios.

Prevention: Use a dedicated heartbeat network interface separate from the data network. Configure the Shadow with a longer promotion timeout to survive brief network blips.

Modern Integrations — REST API, Cloud & Containers

AutoSys isn't just shell scripts on bare-metal Linux anymore. As a workload automation platform, it needs to integrate with the modern enterprise stack — REST APIs, containers, cloud infrastructure. Modern pipelines call REST APIs mid-run, spin up containers for compute-heavy steps, and dispatch jobs to agents running in cloud VPCs. The integration tooling to support this has been in AutoSys r12.x for a while — it's just less well-documented than the core JIL features.

REST API / Web Services Jobs

The Web Services job type (job_type: WS) allows AutoSys to call REST or SOAP endpoints directly as a job — no wrapper script needed. The job succeeds or fails based on the HTTP response code.

rest_api_job.jil
JIL
/* Web Services job — calls a REST endpoint */
insert_job: JOB_TRIGGER_RISK_ENGINE   job_type: WS
box_name:       BOX_PAYMENT_EOD
web_svc_url:    https://risk-engine.internal/api/v2/run-eod
web_svc_method: POST
web_svc_body:   {"run_date":"$$GVAR_RUN_DATE","mode":"full"}
web_svc_success_codes: 200,201,202
condition:      success(JOB_LOAD_TXN)
max_run_alarm:  15

Container Jobs

AutoSys supports running Docker containers as jobs via the DOCKER job type or via CMD jobs that invoke docker run or kubectl. The container job type manages the container lifecycle — pull, run, capture exit code, clean up.

container_job.jil
JIL
/* CMD job wrapping a Docker container run */
insert_job: JOB_RISK_CALC_CONTAINER   job_type: CMD
command: docker run --rm \
  -e RUN_DATE=$$GVAR_RUN_DATE \
  -v /data/risk:/data \
  registry.internal/risk-calculator:2.1.4 \
  --mode full-eod
machine:       docker-host-01
owner:         svc_autosys
max_run_alarm: 45

/* Kubernetes job via kubectl */
insert_job: JOB_RISK_CALC_K8S   job_type: CMD
command: kubectl create job risk-calc-$$GVAR_RUN_DATE \
  --from=cronjob/risk-calculator \
  --namespace=batch-jobs
machine:       k8s-bastion-01
max_run_alarm: 60

Cloud Agents — AWS & Azure Patterns

AutoSys cloud agents run on AWS EC2, Azure VMs, or GCP instances and register back to the on-premises Event Server. From AutoSys's perspective they're just another machine attribute — the job definition is identical whether it runs on-prem or in cloud. This makes AutoSys a genuine hybrid workload automation platform, enabling seamless integration between on-premises batch processes and cloud-based transformation, ML workloads, and modern data pipelines.

The patterns below cover the two most common cloud integration use cases: triggering AWS Glue / Lambda from an AutoSys CMD job via the AWS CLI, and dispatching Azure Batch tasks via the az CLI. Both use an agent running in the target cloud account — no firewall holes needed for job payloads, only the agent's outbound registration to EAS.

aws_integration.jil
JIL
/* Trigger AWS Glue ETL job from AutoSys — agent runs on EC2 with IAM role */
insert_job: JOB_AWS_GLUE_ETL   job_type: CMD
machine:       aws-agent-prod-01       /* EC2 instance with AutoSys agent */
owner:         svc_autosys
command:       aws glue start-job-run \
  --job-name etl-payment-settlements \
  --arguments '{"--run_date":"$$GVAR_RUN_DATE"}' \
  --region eu-west-1 \
  --query 'JobRunId' --output text
condition:      success(JOB_EXTRACT_TXN)
max_run_alarm:  30
box_name:       BOX_PAYMENT_EOD

/* Invoke AWS Lambda synchronously and gate on success */
insert_job: JOB_AWS_LAMBDA_VALIDATE   job_type: CMD
machine:       aws-agent-prod-01
owner:         svc_autosys
command:       aws lambda invoke \
  --function-name validate-settlement-data \
  --payload '{"date":"$$GVAR_RUN_DATE"}' \
  --cli-binary-format raw-in-base64-out \
  /tmp/lambda_response.json \
  && grep -q '"statusCode":200' /tmp/lambda_response.json
condition:      success(JOB_AWS_GLUE_ETL)
max_run_alarm:  10
box_name:       BOX_PAYMENT_EOD
azure_integration.jil
JIL
/* Submit Azure Batch task from AutoSys — agent runs on Azure VM with Managed Identity */
insert_job: JOB_AZURE_BATCH_RISK   job_type: CMD
machine:       azure-agent-prod-01     /* Azure VM with AutoSys agent */
owner:         svc_autosys
command:       az batch task create \
  --account-name batchprodaccount \
  --job-id risk-calc-job \
  --task-id risk-$$GVAR_RUN_DATE \
  --command-line "/bin/bash -c 'python /scripts/risk_calc.py --date $$GVAR_RUN_DATE'" \
  && az batch task show \
  --account-name batchprodaccount \
  --job-id risk-calc-job \
  --task-id risk-$$GVAR_RUN_DATE \
  --query 'executionInfo.exitCode' | grep -q '^0$'
condition:      success(JOB_LOAD_TXN)
max_run_alarm:  60
box_name:       BOX_PAYMENT_EOD
IAM and Managed Identity, not credentials in JIL. Never embed AWS access keys or Azure service principal secrets in command attributes — they end up in the AutoSys database in plaintext. The correct pattern is to assign an IAM role (AWS) or Managed Identity (Azure) directly to the EC2 / Azure VM running the AutoSys agent. The CLI tools (aws, az) automatically use instance credentials. For Kubernetes workloads, the pattern of creating a job from a CronJob template works well — use kubectl wait --for=condition=complete job/risk-calc-DATE --timeout=3600s in a subsequent CMD job to gate downstream AutoSys jobs on Kubernetes job completion.

AI Job Types and Intelligent Automation (R26+)

Broadcom's AutoSys R26 roadmap introduces AI-native job types and AI-assisted operations as first-class platform features — a significant shift for workload automation. The headline capability is AI job remediation: when a job fails, the platform can analyse the failure pattern, match it against historical incidents, and suggest or automatically apply a recovery action (restart with modified parameters, skip and proceed, or page on-call with a pre-populated runbook). For shops running hundreds of nightly batch jobs, this reduces mean-time-to-recovery on common failure classes without requiring an on-call engineer to diagnose from scratch.

Two additional capabilities worth tracking for R26+: AI job type (dispatching workloads to LLM-backed services as a native job, not a wrapper script), and MCP-based orchestration that allows AutoSys to act as a workflow coordinator for multi-agent AI pipelines — where individual steps are AI model invocations rather than shell commands. These features are in active development as of R26 and will be most relevant for enterprises building ML pipelines on top of existing AutoSys infrastructure. If you're evaluating AutoSys for a new installation today, confirm whether your licensed version includes these capabilities before designing AI-integrated workflows around them.

💡
On existing AutoSys versions (r12.x / r21): AI integration today means a CMD job calling an ML model API endpoint — exactly the Web Services job pattern above. The R26 AI job types formalise this into a first-class construct with built-in result parsing, retry logic, and model version pinning. The JIL patterns in earlier sections remain fully valid regardless of version.

The jil Command — Applying and Managing JIL

Job Information Language defines your jobs. The jil command is what loads them into the database. There are three ways to run it — and one of them (the syntax validator) is the one most people skip until they've had a bad day.

Applying JIL — Three Methods

applying_jil.sh
BASH
# Method 1: Redirect a JIL script file (most common in production)
jil < payment_eod.jil

# Method 2: Interactive mode — type JIL statements directly, Ctrl+D to commit
jil
# jil> insert_job: JOB_TEST  job_type: CMD
# jil> command: /opt/test.sh
# jil> machine: proc-01
# jil> ^D   (Ctrl+D commits to database)

# Method 3: Validate syntax WITHOUT committing to database
# Always run this before applying in production
jil -syntax < payment_eod.jil
# If valid: no output and exit code 0
# If invalid: error message with line number

# Update an existing job definition
update_job: JOB_EXTRACT_TXN
max_run_alarm: 90        /* change one attribute, rest unchanged */

# Delete a job
delete_job: JOB_OLD_EXTRACT

# Delete a box AND all jobs inside it
delete_box: BOX_OLD_PIPELINE

# Delete a global variable
delete_glob: GVAR_DEPRECATED_FLAG

# Register a machine in AutoSys topology
insert_machine: new-etl-server-01
max_load:    10
factor:      1.00
opsys:       LINUX
description: "New ETL processing server"
Always run jil -syntax < script.jil before applying to production. The syntax checker validates the entire script without touching the database. A single syntax error in a 200-job JIL file will abort the entire import — leaving you with a partially applied definition. Validate first, always.
💡
update_job only changes the attributes you specify. All other attributes remain exactly as they were. This is the safe way to change a single attribute on a live job without risk of accidentally resetting other settings. Use insert_job only when creating a new job or when you deliberately want to reset all attributes.
Production Incident — Partial JIL Import

Symptom: Only 40 of 60 jobs in a JIL script were created. The other 20 were missing with no error in the AutoSys logs.

Root cause: A syntax error on line 180 caused the jil command to abort mid-import. Jobs defined after line 180 were never loaded.

Fix: Ran jil -syntax < script.jil, found the error (a missing colon in a condition attribute), fixed it, re-ran the full import. Jobs already created by the partial import needed to be manually deleted first.

Prevention: Always validate with -syntax before importing. In CI/CD pipelines, add jil -syntax as a mandatory gate before any JIL deployment.

Virtual Machines and Load Balancing

AutoSys virtual machines have nothing to do with hypervisors. They're logical pools of real agent machines — you point a job at the pool and AutoSys picks which agent to actually run it on. If you're running dozens of batch jobs that can execute on any of several identical servers, virtual machines let you distribute that load automatically instead of hardcoding machine names into every job definition.

virtual_machine.jil
JIL
/* Define a virtual machine containing 3 real agents */
insert_machine: VM_ETL_POOL
type:           v           /* v = virtual machine */
machine_method: ROUNDROBIN  /* distribute jobs evenly */
real_machines:  etl-proc-01 etl-proc-02 etl-proc-03
description:    "ETL processing pool — 3 agents"

/* Job targets the virtual machine — AutoSys picks the real agent */
insert_job: JOB_PROCESS_BATCH   job_type: CMD
command:    /opt/etl/process.sh
machine:    VM_ETL_POOL     /* targets pool, not specific machine */
max_run_alarm: 30

Load Balancing Methods

MethodHow It WorksBest For
ROUNDROBINJobs distributed sequentially across all available agentsUniform job sizes, simple pools
CPU_MONJob sent to agent with lowest current CPU usageMixed workloads with variable CPU demand
JOB_LOADUses job_load and max_load attributes to track theoretical loadJobs with known resource weights
💡
ROUNDROBIN is the safest default for most enterprise pools. CPU_MON requires the rstatd daemon running on all target machines — if it's not running, AutoSys falls back to CPU_MON silently, which may not distribute as expected. Confirm rstatd status before using CPU_MON in production.

Advanced Condition Syntax

The condition attribute has more options than most people use. The basics — success(JOB) and AND/OR combinations — cover 80% of pipelines. But the other 20% is where things get interesting: jobs that should only run if something is not running, conditions that check freshness with lookback windows, and exit codes that aren't as binary as they look.

advanced_conditions.jil
JIL
/* done() — runs if job is in ANY completed state (SUCCESS, FAILURE, TERMINATED)
   Use when you want to proceed regardless of how the upstream job ended */
condition: done(JOB_OPTIONAL_CLEANUP)

/* notrunning() — runs only if the specified job is NOT currently executing
   Use to prevent two jobs running on the same resource simultaneously */
condition: notrunning(JOB_DB_MAINTENANCE)

/* failure() with lookback — trigger if upstream failed recently
   Useful for alerting jobs that should only fire on fresh failures */
condition: failure(JOB_PAYMENT_FEED, 01.00)

/* Lookback using colon syntax — must escape colon with backslash
   Both formats are valid: 01.30 and 01\:30 */
condition: success(JOB_RISK_ENGINE, 01\:30)

/* Combination — complex real-world condition */
condition: success(JOB_EXTRACT, 02.00) AND
            notrunning(JOB_DB_BACKUP) AND
            value(GVAR_MARKET_OPEN) = "Y"

max_exit_success — Treating Non-Zero Exit Codes as Success

By default AutoSys marks a job FAILURE if it exits with any non-zero code. The max_exit_success attribute lets you define a threshold — any exit code up to and including that value is treated as SUCCESS. Critical for scripts that use exit codes to signal warnings rather than failures.

max_exit_success.jil
JIL
insert_job: JOB_DATA_VALIDATION   job_type: CMD
command:         /opt/validate/run_checks.sh
machine:         etl-proc-01
max_exit_success: 4
/* Exit codes 0-4 treated as SUCCESS
   Exit code 0 = all checks passed
   Exit codes 1-4 = warnings (some checks failed but acceptable)
   Exit code 5+ = FAILURE (critical errors) */
💡
ON_ICE affects lookback evaluation. Per official docs: if a predecessor job is in ON_ICE status, any lookback condition on it always evaluates to true — the lookback window is ignored. This means if you ice a job to skip it, downstream jobs with lookback conditions will proceed as if it succeeded, regardless of the lookback window.

Resources — Concurrency Control

Without resource controls, AutoSys Workload Automation dispatches jobs as fast as conditions allow. That's fine until four load jobs hit the same Oracle database simultaneously and everything grinds to a halt. Resources are named concurrency counters — you define how many units exist, jobs declare how many they consume, and AutoSys queues the rest until capacity frees up.

resources.jil
JIL
/* Define a resource — max 3 concurrent DB connections */
insert_resource: DB_CONNECTIONS
quantity: 3
description: "Max concurrent Oracle DW connections"

/* Jobs that consume this resource — each consumes 1 unit */
insert_job: JOB_LOAD_PAYMENTS   job_type: CMD
command:    /opt/etl/load_payments.sh
machine:    etl-proc-01
resources:  DB_CONNECTIONS      /* consumes 1 unit */

insert_job: JOB_LOAD_TRADES   job_type: CMD
command:    /opt/etl/load_trades.sh
machine:    etl-proc-02
resources:  DB_CONNECTIONS      /* waits if 3 already running */

/* A heavy job consuming multiple units */
insert_job: JOB_BULK_LOAD   job_type: CMD
command:    /opt/etl/bulk_load.sh
machine:    etl-proc-01
resources:  DB_CONNECTIONS(2)   /* consumes 2 units — counts as 2 connections */
Resources are the correct way to prevent database overload — not by adding artificial conditions or sleep commands. When a job needs a resource unit that's fully consumed, it stays in ACTIVATED state waiting. Use autorep -r DB_CONNECTIONS -s to check current resource utilisation.

FTP Job Type

Most teams handle file transfers with wrapper shell scripts that call sftp or ftp — and then spend time debugging script quoting issues, missing error handling, and logs that don't tell you what actually failed. The FTP job type (job_type: FTP) replaces all of that. AutoSys Workload Automation manages the connection, authentication, transfer, and exit code natively.

ftp_job.jil
JIL
/* FTP job — download file from remote server */
insert_job: JOB_FTP_GET_PAYMENT   job_type: FTP
box_name:       BOX_PAYMENT_EOD
ftp_machine:    sftp.partner-bank.com
ftp_user:       svc_transfer
ftp_password:   %%ENCRYPTED_PASSWORD%%
ftp_src_file:   /outgoing/payment_$$GVAR_RUN_DATE.csv
ftp_dest_file:  /data/incoming/payment_$$GVAR_RUN_DATE.csv
ftp_dest_dir:   /data/incoming
machine:        file-drop-01   /* agent that performs the transfer */
description:    "Download daily payment file from partner bank"
max_run_alarm:  15

/* FTP job — upload results to remote server */
insert_job: JOB_FTP_PUT_REPORT   job_type: FTP
box_name:       BOX_PAYMENT_EOD
ftp_machine:    reporting.internal
ftp_user:       svc_reports
ftp_src_file:   /data/reports/eod_$$GVAR_RUN_DATE.csv
ftp_dest_dir:   /reports/incoming
machine:        file-drop-01
condition:      success(JOB_GENERATE_REPORT)
💡
Never hardcode FTP passwords in plain text JIL. Use AutoSys credential management or encrypted password references. Plain text passwords in JIL files are a security audit finding in every enterprise. The %%ENCRYPTED_PASSWORD%% pattern uses the AutoSys credential vault.

Calendars — Advanced Scheduling

If you've ever had a month-end job silently not run because nobody thought about what happens when the last business day falls on a weekend — you needed calendars. Calendars are named sets of dates that jobs and boxes reference for scheduling. Instead of hardcoding days_of_week and hoping it covers edge cases, you define the dates centrally and reference them across as many jobs as needed.

Defining Calendars in JIL

calendars.jil
JIL
/* Standard calendar — specific dates the job SHOULD run */
insert_calendar: CAL_TRADING_DAYS_2026
datetimes:  01/02/2026 01/05/2026 01/06/2026 01/07/2026
            01/08/2026 01/09/2026  /* add all trading days */
description: "NYSE trading days 2026"

/* Extended calendar — calculates dates by rule
   last_business_day: runs on last business day of each month */
insert_calendar: CAL_MONTH_END
type:        extended
definition:  "last_business_day"
description: "Last business day of each month"

/* Exception calendar — dates the job should NOT run
   Use with run_calendar to exclude holidays */
insert_calendar: CAL_HOLIDAYS_2026
datetimes:  01/01/2026 05/25/2026 07/04/2026 12/25/2026

/* Apply calendar to a box job */
insert_job: BOX_TRADING_EOD   job_type: BOX
run_calendar:  CAL_TRADING_DAYS_2026
exclude_calendar: CAL_HOLIDAYS_2026
start_times:  "18:00"

Checking and Diagnosing Calendars

calendar_commands.sh
BASH
# List all dates in a calendar — critical for month-end debugging
autorep -c CAL_TRADING_DAYS_2026 -t

# List all defined calendars
autorep -c %

# Check next scheduled run dates for a box
autorep -J BOX_TRADING_EOD -d | grep -i calendar

# Forecast when a job will next run (shows upcoming scheduled dates)
forecast -J BOX_TRADING_EOD -t 30
Production Incident — Missing Calendar Dates

Symptom: Month-end reporting jobs didn't run in March. No failures — jobs show INACTIVE the entire day.

Root cause: The calendar CAL_BUSINESS_DAYS_2026 was built from a template that didn't account for March 31 being a Tuesday (valid business day). A data entry error left it off the datetimes list.

Fix: autorep -c CAL_BUSINESS_DAYS_2026 -t confirmed March 31 was missing. Added it with update_calendar JIL, then force-started the box with sendevent -E FORCE_STARTJOB -J BOX_EOM_REPORTS -D 20260331.

Prevention: After building any annual calendar, run autorep -c CALENDAR_NAME -t and manually verify the count of dates matches expected business days for the year.

job_depends — Validating Condition References

Rename a job. Delete a job. Move a job between boxes. Any of those changes can silently break conditions in jobs that reference the old name — and you won't know until those downstream jobs sit in ACTIVATED forever with no error, no alarm, no indication that anything is wrong. In AutoSys Workload Automation, a condition referencing a non-existent job evaluates silently to false. job_depends is the command that catches this before it catches you.

job_depends_commands.sh
BASH
# Check if all condition references in a job are valid
# Reports any job names referenced in conditions that don't exist in AutoSys
job_depends -J JOB_LOAD_TXN

# Check an entire box and all its children
job_depends -J BOX_PAYMENT_EOD

# Check all jobs in the system — run this after any large JIL import
job_depends -J %

# Example output when a dependency is broken:
# JOB_LOAD_TXN: condition job JOB_TRANSFORM_V2_TXN not found in database
# This means JOB_LOAD_TXN will never start — its condition always false
Run job_depends -J % after every JIL deployment. When jobs are renamed, deleted, or migrated between environments, condition references can become dangling pointers. job_depends catches all of them in one pass. A job with a broken condition will silently never run — there is no error, no alarm, just an ACTIVATED job that waits forever.
Production Incident — Renamed Job Broke 12 Conditions

Symptom: After a JIL refactor renaming JOB_EXTRACT to JOB_EXTRACT_TXN, 12 downstream jobs stopped running. They showed ACTIVATED indefinitely.

Root cause: The 12 jobs had condition: success(JOB_EXTRACT). The old name no longer existed. Conditions silently evaluated to false.

Fix: job_depends -J % immediately identified all 12 broken references. Updated all conditions to reference JOB_EXTRACT_TXN.

Prevention: Make job_depends -J % part of your deployment runbook. Run it after every JIL change that renames or deletes jobs.

Key Takeaways
01AutoSys is event-driven, not time-driven. Jobs run when conditions are met, not just at scheduled times.
02ON_HOLD pauses a job but blocks downstream. ON_ICE skips a job and lets downstream proceed. Know which you need before acting.
03CHANGE_STATUS is your most important recovery tool. Use it when work was done manually and you need the pipeline to continue without re-running.
04Always set max_run_alarm on long-running jobs. Without it, a hung job runs silently forever and the box never completes.
05STARTING stuck for more than 5 minutes means the remote agent is down. Check the agent before investigating the job.
06Global variables persist across box runs. Always reset critical variables at pipeline start to avoid processing stale data from previous runs.
07Use EEM roles (Operator, Developer, Admin) mapped to LDAP groups. On-call engineers need Operator access only — never hand out Admin for day-to-day support.
08In Dual Event Server HA setups, use a virtual hostname for agent connections — this makes failover transparent to all remote agents.
09Web Services job type (WS) calls REST APIs natively — no wrapper script needed. Container workloads run via CMD jobs invoking docker or kubectl on an agent machine.
10autorep -J JOB_NAME -f gives you machine, exit code, and timestamps — always use -f first when diagnosing a failure, not just -s.

AutoSys Version History — What Changed in R21, R24, and R26

AutoSys version numbers matter in production. The version your organisation runs determines which JIL attributes are available, how security is enforced, and whether features like containerised agents or AI job types are in scope. This section tracks the key changes across recent major releases — useful when upgrading, auditing a legacy environment, or comparing capabilities with a prospective employer's stack.

ReleaseKey ChangesImpact
r12.xBaseline for most legacy enterprise installs. EEM introduced for RBAC. Oracle / MSSQL backend. Web UI (WAAE). JIL-based job management. WS and FTP job types available.Still running in many banks and telcos. Most JIL patterns in this guide apply fully.
r11.3 / r11.3.6Widely deployed version in financial services. Stable but approaching end of support in many regions.Check Broadcom support lifecycle before planning new deployments on this version.
r21Modernised Web UI. Improved REST API surface. Enhanced container job support. Updated agent communication protocols.First version where container and REST job types are production-grade for most use cases.
r24.1Secure Agent Communication (HTTPS) — agents register and communicate over TLS by default. Enhanced Security hardening across EEM integration. UI modernisation continued. Improved telemetry controls.If your organisation has strict data-in-transit requirements, r24.1 is the minimum version to target for new deployments. Existing r12.x environments upgrading to r24.1 need to plan agent certificate rollout.
r26 (roadmap)AI job remediation — failure pattern matching with automated recovery suggestions. AI job type — LLM-backed service invocation as a native job construct. MCP orchestration — AutoSys as coordinator for multi-agent AI workflows. Further UI and API modernisation.Not yet GA at time of writing. Evaluate for environments running ML pipelines or looking to reduce on-call toil from batch failure diagnosis.

Upgrade Considerations: r12.x → r24.1

The most common upgrade path in enterprise shops is moving off r12.x toward r21 or r24.1. The key compatibility considerations:

upgrade_checklist.sh
BASH
# 1. Audit existing JIL for deprecated attributes before upgrade
autorep -J % -q | grep -E 'deprecated_attr|old_param'

# 2. Export all job definitions to JIL files for backup and diff
autorep -J % -q > all_jobs_backup.jil

# 3. Validate exported JIL against new version syntax checker
#    (run on a test instance running the target version)
jil -syntax < all_jobs_backup.jil

# 4. Check EEM policy compatibility — role definitions may need migration
#    Review EEM role mappings before cutover

# 5. Plan agent certificate rollout for r24.1 HTTPS comms
#    Each remote agent needs a signed cert or trust of the EAS CA
autoping -m all_agents  # Verify all agents reachable before upgrade
💡
On version terminology: Broadcom rebranded CA Workload Automation AE as "Automic Workload Automation" in some documentation and "Broadcom Workload Automation" in others, depending on the release and product line. When talking to vendors or searching documentation, "AutoSys r24", "CA WA AE r24", and "Broadcom WA AE r24.1" all refer to the same product. The JIL syntax and CLI tools (sendevent, autorep, jil) are consistent across all of these — they're just marketing names.

Production JIL Templates and Troubleshooting Checklist

The patterns below are production-tested templates. Copy, rename the job and box identifiers, and adjust the machine and owner attributes for your environment. Every template includes the attributes most commonly omitted in first drafts — the ones that cause 3am pages.

Minimal Production CMD Job

cmd_production_minimal.jil
JIL
/* Production CMD job — minimum viable attributes for a reliable batch job */
insert_job: JOB_YOUR_NAME         job_type: CMD
box_name:        BOX_YOUR_BOX
machine:         your-agent-hostname
owner:           svc_autosys          /* service account, never a personal account */
command:         /opt/scripts/your_script.sh $$GVAR_RUN_DATE
std_out_file:    /logs/autosys/JOB_YOUR_NAME_$$DATE.out
std_err_file:    /logs/autosys/JOB_YOUR_NAME_$$DATE.err
alarm_if_fail:   1
max_run_alarm:   30                   /* alert if running more than 30 mins */
condition:       success(JOB_UPSTREAM)
timezone:        GMT                  /* always set explicitly — never rely on agent default */
description:     "What this job does and who owns it — ops@yourcompany.com"

File Watcher with Age Check

file_watcher_with_age.jil
JIL
/* File Watcher with size and age guards — prevents triggering on empty or stale files */
insert_job: FW_SETTLEMENT_FILE      job_type: FW
box_name:        BOX_SETTLEMENT_EOD
machine:         file-agent-prod-01
owner:           svc_autosys
watch_file:      /data/incoming/settlement_$$DATE.csv
min_file_size:   1                    /* reject empty files */
file_watch_interval: 60             /* check every 60 seconds */
watch_interval:  60
alarm_if_fail:   1
max_run_alarm:   480                 /* alert if file doesn't arrive within 8 hours */
timezone:        GMT

BOX with Calendar and Dependency Chain

box_calendar_driven.jil
JIL
/* Calendar-driven BOX — month-end safe pattern */
insert_job: BOX_MONTH_END_CLOSE     job_type: BOX
owner:           svc_autosys
run_calendar:    LAST_BUSINESS_DAY    /* named calendar, not days_of_week */
start_times:     "20:00"
timezone:        GMT
alarm_if_fail:   1
box_failure:     1                    /* box goes FAILURE if any child fails */
description:     "Month-end close pipeline — finance-ops@yourcompany.com"

/* Child job pattern: always reference the box, never schedule children independently */
insert_job: JOB_EXTRACT_GL          job_type: CMD
box_name:        BOX_MONTH_END_CLOSE
machine:         batch-agent-prod-01
owner:           svc_autosys
command:         /opt/scripts/extract_gl.sh $$GVAR_RUN_DATE
std_out_file:    /logs/autosys/JOB_EXTRACT_GL_$$DATE.out
std_err_file:    /logs/autosys/JOB_EXTRACT_GL_$$DATE.err
alarm_if_fail:   1
max_run_alarm:   45

Production Troubleshooting Checklist

When an AutoSys job fails in production, work through this checklist before touching anything. Acting before diagnosing is the fastest path to a longer outage.

#CheckCommandWhat to look for
1What is the current job status?autorep -J JOB_NAME -sConfirm FAILURE vs TERMINATED vs INACTIVE — they mean different things
2What was the exit code?autorep -J JOB_NAME -dExit code 127 = command not found. 126 = permission denied. 1 = script error. 0 = success
3Did the job actually start on the agent?Check std_err_file on the agent machineEmpty err file = job never started. Non-empty = script ran but failed
4Is the agent reachable?autoping -m MACHINE_NAMEIf agent is unreachable, job will hang in STARTING indefinitely
5Are upstream dependencies satisfied?autorep -J UPSTREAM_JOB -sJob may be waiting for a condition that will never be true this run
6Is this a timing issue?autorep -J JOB_NAME -d | grep run_windowJob may have missed its run window and won't start until next cycle
7Has this failed before?Check job history in WCC or autorep -J JOB_NAME -LPattern of failures at same time = environmental issue, not code bug
8What changed recently?Check jil audit log in AutoSys databaseJob definition changes, machine changes, calendar updates
9Is the box in a bad state?autorep -J BOX_NAME -sRUNNING box with child in FAILURE won't auto-recover — may need FORCE_STARTJOB on box
10Is this a global variable issue?autorep -G GVAR_NAMEEmpty or wrong value will cause condition-dependent jobs to never start
The rule that saves the most time: before sending any sendevent, run autorep -J JOB_NAME -d and read the full job definition output. Nine out of ten production incidents are caused by a condition attribute, a wrong machine name, or a missing global variable that takes 30 seconds to spot in the definition — but 90 minutes to diagnose by guessing.

Interview Questions

These come up repeatedly in AutoSys and workload automation interviews at investment banks, telcos, and enterprise tech shops. They cover scheduling logic, automation patterns, and production recovery — and they test whether you've actually been on-call, actually debugged a stuck pipeline, actually had to decide between FORCE_STARTJOB and CHANGE_STATUS under pressure. Knowing the theory is table stakes. The answers here go further.

Q What is the difference between ON_HOLD and ON_ICE in AutoSys?
ON_HOLD prevents a job from running but it remains visible to the scheduler — downstream jobs still wait for it to complete before proceeding. ON_ICE removes the job from scheduler consideration entirely — downstream jobs treat it as if it already succeeded and proceed without waiting. Use ON_HOLD when you want to delay execution and resume later. Use ON_ICE when you want to permanently skip a job for this cycle without blocking the pipeline.
Q A box job is stuck in RUNNING but all child jobs show SUCCESS. What happened and how do you fix it?
The most common cause is that the box itself has a condition attribute referencing a job outside the box that hasn't completed. Run autorep -J BOX_NAME -d to inspect the box's own condition. Also check if any child job is in a state other than SUCCESS — autorep sometimes truncates output. If the box condition references an external job, check that job's status and either fix it or force its status to SUCCESS. If all conditions are genuinely met, try sendevent -E FORCE_STARTJOB -J BOX_NAME to re-evaluate.
Q When would you use CHANGE_STATUS instead of FORCE_STARTJOB?
Use CHANGE_STATUS when the work represented by the job has already been completed outside of AutoSys — for example, a DBA manually ran the SQL that the job would have executed, or a file transfer was done manually. In this case you don't want the job to re-run (which could cause duplicates or conflicts), you just want AutoSys to know it's done so downstream jobs can proceed. FORCE_STARTJOB actually executes the job again, which you only want when the job genuinely needs to re-run.
Q How do you restart an entire failed pipeline from a specific step without re-running steps that already succeeded?
You cannot directly restart from a specific step, but you can achieve the same result. First, use CHANGE_STATUS to mark all jobs that already succeeded as SUCCESS (if they're not already). Then FORCE_STARTJOB on the specific job that failed. AutoSys will re-evaluate conditions and since the earlier jobs show SUCCESS, the failed job's conditions are met and it will run. Any jobs downstream with condition: success(FAILED_JOB) will automatically queue once the re-run succeeds.
Q What does n_retrys do and what are its limitations?
n_retrys tells AutoSys to automatically restart a failed job up to N times before marking it as permanently FAILURE. Combined with retry_interval (minutes between retries), it handles transient failures like network timeouts. The key limitation: retries only apply to FAILURE exit codes, not to TERMINATED status (jobs killed via KILLJOB). Also, if the job fails during a box run and the box completes (because other jobs don't depend on this one), retries stop — the box ending resets all job states.
Q What is the difference between FORCE_STARTJOB and STARTJOB?
STARTJOB starts a job only if its conditions are met — it respects the job's condition attribute and will not start if conditions are unsatisfied. FORCE_STARTJOB bypasses all conditions, time windows, and date restrictions and starts the job immediately regardless of its state. Use STARTJOB to trigger a job within its normal logic. Use FORCE_STARTJOB for emergency recovery when you need to override everything.
Q A job is stuck in STARTING for 15 minutes. Walk through your exact diagnosis steps.
Step 1: Run autoping -m MACHINE_NAME — if it fails, the agent is unreachable. Fix the agent first (restart cybagent service). Step 2: If autoping succeeds, run autorep -M MACHINE_NAME -r to confirm the machine alias is registered correctly in AutoSys topology. Step 3: SSH to the target machine and verify the owner account exists: id svc_autosys. If the user doesn't exist, the job stays in STARTING indefinitely with no error. Step 4: Check the owner has execute permission on the script path. Step 5: Check EAS logs for dispatch errors.
Q What is a lookback condition and when would you use one?
A lookback condition restricts a dependency to a success within a specific time window — for example, success(JOB_MARKET_FEED, 00.30) means the job only satisfies the condition if it succeeded within the last 30 minutes. Without a lookback, a job that succeeded hours or even days ago still satisfies the condition. Lookbacks are essential for time-sensitive pipelines — market data feeds, regulatory cutoffs, real-time settlement — where stale data from a previous run completing the condition would be dangerous.
Q How do global variables work in AutoSys and what is their biggest production risk?
Global variables are key-value pairs stored in the AutoSys database, accessible to any job on any machine using the $$VARNAME syntax in commands or value(VARNAME) in conditions. They're set via sendevent -E SET_GLOBAL or defined in JIL with insert_global. The biggest production risk is persistence — global variables retain their value across box runs. If last night's run set GVAR_PAYMENT_FILE to a specific path and tonight's file watcher fails silently, the variable still holds yesterday's path and downstream jobs will process stale data. Always reset critical variables at the start of each pipeline run.
Q What does the machine attribute in JIL reference — the hostname, IP, or something else?
The machine attribute references the alias registered in the AutoSys topology database, not the server's actual hostname or IP address. These can be identical, but they don't have to be. You can verify registered machine aliases with autorep -M % -r. Using the wrong value — the actual hostname when the topology alias is different — leaves the job stuck in STARTING indefinitely because the Event Server cannot resolve the target agent. Always confirm the topology alias before defining a new job for a new machine.
Q What is the difference between alarm_if_fail and alarm_if_terminated?
alarm_if_fail triggers an alarm when a job exits with a non-zero exit code (FAILURE status). alarm_if_terminated triggers an alarm when a job is killed via KILLJOB or by max_run_alarm expiry (TERMINATED status). In production, set both on critical jobs — a job that hangs and gets killed by max_run_alarm will not trigger alarm_if_fail because its status is TERMINATED, not FAILURE. Without alarm_if_terminated, a silently hung-and-killed job can go unnoticed.
Q How does AutoSys High Availability work, and what is split-brain?
AutoSys Workload Automation HA uses a Primary/Shadow Event Server configuration sharing the same database. The Shadow monitors the Primary's heartbeat — if the heartbeat stops, the Shadow promotes itself to Primary and resumes scheduling within ~60 seconds. Split-brain occurs when a network partition causes the Shadow to lose the heartbeat temporarily even though the Primary is still alive. Both servers promote themselves to Primary simultaneously and dispatch the same jobs, causing duplicate runs. Prevention: dedicated heartbeat network interface, a longer promotion timeout on the Shadow, and a database-level fencing lock to prevent dual-active.
Q What is the permission attribute in JIL and what are the risks of using wx or we?
The permission attribute controls who can execute (x) and edit (e) a job. The prefixes are g (group), w (world), and m (me/owner). wx means any user in the system can force-start the job; we means any user can modify the JIL definition. In enterprise environments, wx and we are serious security risks — any developer or operator account could accidentally or maliciously modify or trigger a production settlement job. Best practice is gx,ge only, mapping the group to a controlled AD/LDAP group via EEM.
Q How would you use AutoSys to call a REST API as part of a pipeline?
Use the Web Services job type (job_type: WS), which is native in AutoSys r12.x+. Define web_svc_url, web_svc_method (GET/POST), web_svc_body for the request payload, and web_svc_success_codes for the HTTP codes that constitute success (typically 200,201,202). The job succeeds or fails based on the response code — no wrapper script needed. Set max_run_alarm to handle hung connections. Global variables can be interpolated into the URL or body using $$GVAR_NAME syntax.
Q A calendar-driven box didn't run on the last business day of the month. What are the likely causes?
The most common cause is a calendar definition that doesn't account for month-end on a weekend. If run_calendar: BUSINESS_DAYS is used and the last day of the month falls on Saturday, the box has no trigger date for that run. Other causes: the calendar was updated and the change wasn't applied to all environments; the box has a days_of_week restriction that conflicts with the calendar; or the Event Server was down during the scheduled window and the missed run wasn't caught up. Diagnose with autorep -c CALENDAR_NAME -t to see all scheduled dates.
Q What is the difference between std_out_file and the job log in AutoSys?
std_out_file captures the script's stdout on the remote agent machine — whatever the shell script prints to standard output. The job log in the AutoSys database captures the job's lifecycle events: when it was dispatched, which agent received it, the exit code, start and end timestamps. Both are essential for debugging: the job log tells you what AutoSys did, the std_out_file tells you what the script did. If std_out_file is not set, stdout is lost when the job completes. Use $DATE in the filename to preserve one log per run rather than overwriting.
Q How do you export an existing job definition back to JIL format?
Use autorep -J JOB_NAME -q — the -q flag outputs the job definition in JIL format that can be piped directly to the jil command on another instance. This is the standard way to copy jobs between environments (dev → test → prod), create backups before making changes, or document existing job definitions. For a full box including all children: autorep -J BOX_NAME -q exports the box and every job inside it.
Q What happens to running jobs when the Primary Event Server fails over to the Shadow?
Jobs already dispatched and running on remote agents continue executing — agents run independently and don't require a live Event Server connection to complete a job. The agent writes the exit code and status back to the database when the job completes, and the new Primary Event Server picks up those results on its next database poll. Jobs that were in STARTING or ACTIVATED state at the moment of failover may need to be force-started manually — the transition can cause them to be skipped if the new Primary doesn't see their dispatch record.
Q How do you check why a global variable condition is preventing a job from starting?
Run autorep -G GVAR_NAME to see the current value of the variable. If it's empty or set to an unexpected value, that's why the condition fails. Also run autorep -J JOB_NAME -d to inspect the job's condition attribute and confirm exactly what value the condition is checking. Common scenario: a job has condition: value(GVAR_RUN_DATE) != "" but the variable was never set because an earlier job that calls SET_GLOBAL failed. Fix by manually setting: sendevent -E SET_GLOBAL -G GVAR_RUN_DATE -V "20260417", then force-starting the blocked job.
Q What is the purpose of min_file_size in a File Watcher job and what problem does it solve?
min_file_size sets the minimum file size in bytes that the watched file must reach before the FW job considers it a success. Setting it to 1 prevents the job from triggering on an empty file — a common failure mode in file-based pipelines where the upstream system creates the file immediately but writes data to it over time. Without min_file_size, the FW job succeeds the instant the file appears (even with 0 bytes), the downstream CMD job starts, and attempts to process an empty file. This causes subtle failures that are hard to diagnose because the file exists but contains no data.
Q How do you force-run a box for a specific historical date that was missed?
Use the -D flag with sendevent: sendevent -E FORCE_STARTJOB -J BOX_NAME -D YYYYMMDD. This triggers the box as if it were running on the specified date, which is critical for date-aware jobs that use $$DATE or date-based global variables. Without the -D flag, FORCE_STARTJOB runs the box with today's date, which would cause the jobs to process the wrong data set. Always confirm the date format matches your AutoSys configuration — some environments use MMDDYYYY.
Q What is the EEM role model in AutoSys and how does it differ from JIL permissions?
JIL permissions (permission: gx,ge) are per-job access controls defined in the job definition itself — they control who can execute or edit that specific job based on OS group membership. EEM (Embedded Entitlements Manager) is the centralized RBAC layer that maps roles (Operator, Developer, Admin) to LDAP/AD groups across all jobs in the instance. EEM supersedes JIL permissions in modern AutoSys deployments — it allows consistent access control without touching individual job definitions. Use EEM when you need enterprise-wide role enforcement; use JIL permissions as a secondary layer for job-specific restrictions.
🔥
Naren Founder & Author

20 years in enterprise IT, the last decade working with AutoSys deployments in banking, insurance, and fintech environments — the kind of shops running 800-job nightly batch windows where a single misconfigured condition: attribute at midnight becomes a 3am incident call. The production incidents, gotchas, and debugging patterns in this guide are drawn from those environments, not from documentation.

I built TheCodeForge because I was tired of documentation that explains what to type without explaining why it works — and what breaks when it doesn't.

About Naren → LinkedIn ↗ Get in touch ↗