Definitive Guide · TheCodeForge

The Complete
AutoSys Guide
for Engineers

Verified in production environments · Enterprise-grade examples

Everything you need to master CA AutoSys Workload Automation — from JIL syntax to production incident recovery. Written for engineers who actually run these systems at scale.

19Core Sections
40+Commands & Examples
15Debug Patterns
20+Interview Qs

What AutoSys Actually Is

AutoSys (now officially CA Workload Automation AE, owned by Broadcom) is an enterprise job scheduling system that runs batch jobs, file watchers, and complex multi-step workflows across distributed systems. It is the backbone of nightly processing in investment banks, insurance companies, telcos, and healthcare systems — the infrastructure that runs payroll, bank settlements, ETL pipelines, and regulatory reporting at 2am every night.

Plain English
"Think of AutoSys as a highly opinionated cron — but one where jobs know about each other, dependencies are enforced across servers, and when something breaks at 3am, the on-call engineer gets paged with the exact job name, exit code, and failure reason."

The key difference from cron is that AutoSys jobs are event-driven. A job doesn't just run at a time — it runs when conditions are met: another job succeeded, a file appeared in a directory, a global variable changed, or an external trigger fired. This makes it the coordination layer for complex enterprise pipelines where Job B must wait for Job A, and Job C must wait for both.

AutoSys vs Cron vs Control-M

FeatureCronAutoSysControl-M
Job dependenciesNoneFull condition logicFull condition logic
Cross-server jobsPer-server onlyRemote agentsRemote agents
File watchersNoNative FW job typeNative
AlertingManual setupBuilt-in alarmsBuilt-in
Restart/recoveryManualsendevent commandsGUI + commands
Audit trailLogs onlyFull event history in DBFull event history
Typical useSimple scheduled tasksEnterprise batch pipelinesEnterprise batch pipelines

Architecture Deep Dive

AutoSys has three main components: the Event Server (EAS), Remote Agents, and the AutoSys database. Understanding how they interact is essential for diagnosing failures and designing reliable pipelines.

AutoSys Component Architecture
sendevent CLI
WAAE Web UI
External App Trigger
↓ events & commands
Event Server (EAS) — schedules, evaluates conditions, dispatches
↓ dispatches jobs via TCP
Remote Agent (Linux)
Remote Agent (Windows)
Remote Agent (AIX/HP-UX)
↓ job results and status written to
AutoSys Database (Oracle or MSSQL)
Event Server reads/writes all job state to the DB. Agents are stateless — they execute and report back.

The Event Server

The Event Server is the brain. It reads events from the database queue, evaluates job conditions, and dispatches ready jobs to remote agents. It runs as a daemon process (EAS) and must be running for any job scheduling to occur. If the Event Server goes down, jobs queue but do not run.

Remote Agents

Agents are lightweight daemons installed on target machines. They receive job dispatch instructions from the Event Server, execute the command in the job's owner context, capture exit codes and stdout/stderr, and report status back. A machine without an agent cannot run AutoSys jobs.

Production Incident — Agent Connection Failure

Symptom: Jobs targeting a specific machine are stuck in STARTING state for 10+ minutes, then fail with "connection refused."

Root cause: The Remote Agent daemon on the target machine crashed after an OS patch restart. The Event Server couldn't reach it.

Fix: SSH to the target machine, check agent status with ps -ef | grep cybAgent, restart with service cybagent restart, then force-start the affected jobs.

Prevention: Add agent health monitoring to your infrastructure alerting. A dead agent is silent — it doesn't alert AutoSys directly.

JIL Syntax — The Complete Reference

JIL (Job Information Language) is the DSL used to define every object in AutoSys — jobs, boxes, file watchers, global variables, and machine definitions. Every AutoSys engineer needs to read and write JIL fluently.

BOX Job — Pipeline Container

payment_eod.jil
JIL
/* BOX — container for the entire EOD pipeline */
insert_job: BOX_PAYMENT_EOD   job_type: BOX
owner:            svc_autosys
permission:       gx,ge,wx,we,mx,me
date_conditions:  1
days_of_week:     mo,tu,we,th,fr
start_times:      "22:00"
timezone:         GMT
description:      "End-of-day payment settlement pipeline"
alarm_if_fail:    1
alarm_if_terminated: 1

CMD Job — Shell Command Execution

extract_transform_load.jil
JIL
/* Step 1: Extract — no condition, runs when box starts */
insert_job: JOB_EXTRACT_TXN   job_type: CMD
box_name:   BOX_PAYMENT_EOD
command:    /opt/etl/extract_transactions.sh
machine:    db-extract-01
owner:      svc_autosys
std_out_file: /logs/autosys/extract_txn.out
std_err_file: /logs/autosys/extract_txn.err
max_run_alarm: 60

/* Step 2: Transform — depends on extract success */
insert_job: JOB_TRANSFORM_TXN   job_type: CMD
box_name:   BOX_PAYMENT_EOD
command:    /opt/etl/transform.sh
machine:    etl-proc-01
owner:      svc_autosys
condition:  success(JOB_EXTRACT_TXN)
max_run_alarm: 45

/* Step 3: Load — depends on transform, alerts on failure */
insert_job: JOB_LOAD_TXN   job_type: CMD
box_name:   BOX_PAYMENT_EOD
command:    /opt/etl/load_to_dw.sh
machine:    dw-loader-01
owner:      svc_autosys
condition:  success(JOB_TRANSFORM_TXN)
alarm_if_fail: 1
max_run_alarm: 30
n_retrys:   2
retry_interval: 5
Always set max_run_alarm on load jobs. A hung database connection keeps a CMD job in RUNNING state indefinitely — the box never completes and no alert fires. 30 minutes is a reasonable default for most ETL loads.

File Watcher Job

file_watcher.jil
JIL
/* File watcher — triggers when file arrives */
insert_job: FW_PAYMENT_FILE   job_type: FW
box_name:        BOX_PAYMENT_EOD
watch_file:      /data/incoming/payment_*.csv
watch_interval:  60
owner:           svc_autosys
machine:         file-drop-01
min_file_size:   1

/* Process job runs only after file arrives */
insert_job: JOB_PROCESS_PAYMENT   job_type: CMD
box_name:   BOX_PAYMENT_EOD
command:    /opt/payment/process.sh
machine:    proc-01
condition:  success(FW_PAYMENT_FILE)

Job Status Codes Explained

Every AutoSys job cycles through a series of status codes. Knowing what each one means — and what caused it — is the core skill for on-call AutoSys support.

StatusMeaningCommon Cause
INACTIVEJob exists but has never run in this cycleNormal initial state
ACTIVATEDBox started, job is waiting for its conditionsNormal — waiting for dependencies
STARTINGEvent Server dispatching to remote agentNormal, should be brief (<30s)
RUNNINGExecuting on remote agentNormal during execution
SUCCESSCompleted with exit code 0Normal completion
FAILURECompleted with non-zero exit codeScript error, bad data, permissions
TERMINATEDKilled via sendevent KILLJOBManual intervention or max_run_alarm
ON_HOLDHeld by operator — will not runManual hold, maintenance window
ON_ICEFrozen — invisible to schedulerManual skip, downstream proceeds
RESTARTWaiting to be restarted after failuren_retrys configured, retry pending
💡
STARTING stuck for >5 minutes almost always means the remote agent is down or unreachable. Check the agent first before investigating the job itself.

sendevent — The On-Call Toolkit

When something breaks in production, sendevent is your primary intervention tool. Every AutoSys engineer needs these commands available at 3am without Googling.

Command
What It Does
sendevent -E FORCE_STARTJOB -J <job>
Force start immediately
Bypasses all conditions and time windows. Use when a dependency already completed but status wasn't captured.
sendevent -E KILLJOB -J <job>
Kill a running or hung job
Sends SIGTERM to the process on the remote agent. Job status becomes TERMINATED.
sendevent -E CHANGE_STATUS -J <job> -s SUCCESS
Manually mark as SUCCESS
Critical recovery tool. Use when work was completed manually and you need the pipeline to proceed.
sendevent -E ON_HOLD -J <job>
Put job on hold
Job won't run until released. Downstream jobs wait. Persists across box restarts.
sendevent -E ON_ICE -J <job>
Freeze job completely
Job is invisible to the scheduler. Downstream jobs behave as if it already succeeded.
sendevent -E JOB_OFF_HOLD -J <job>
Release job from hold
Job re-enters scheduling. If conditions are met, it will start on the next evaluation cycle.
sendevent -E JOB_OFF_ICE -J <job>
Unfreeze job
Job re-enters scheduler consideration. Does not automatically trigger a run.
sendevent -E STARTJOB -J <box>
Start a box manually
Starts the box outside its normal schedule. All child jobs will run per their conditions within the box.
Production Incident — Wrong Recovery Tool

Scenario: JOB_LOAD_TXN failed at 23:45 due to a tablespace issue. The DBA fixed the tablespace and manually inserted the data directly. The EOD box is stuck waiting for the load job to succeed.

Wrong fix: sendevent -E FORCE_STARTJOB -J JOB_LOAD_TXN — this re-runs the load script, which will attempt to insert the data again and create duplicates.

Correct fix: sendevent -E CHANGE_STATUS -J JOB_LOAD_TXN -s SUCCESS — marks the job as succeeded without running it, allowing downstream jobs to proceed. No duplicate data.

autorep — Monitoring & Reporting

autorep is the read-only companion to sendevent. It queries the AutoSys database and reports on job status, history, and definitions. Master these commands for rapid diagnosis.

autorep_commands.sh
BASH
# Show current status of a specific job
autorep -J JOB_LOAD_TXN -s

# Show all jobs in a box with their current status
autorep -J BOX_PAYMENT_EOD -s

# Show job definition (JIL attributes)
autorep -J JOB_LOAD_TXN -d

# Show run history for a job (last 10 runs)
autorep -J JOB_LOAD_TXN -s -t

# Show all FAILURE jobs in the last 24 hours
autorep -J % -s | grep FAILURE

# Show all jobs on a specific machine
autorep -M db-extract-01 -s

# Show all currently RUNNING jobs
autorep -J % -s | grep RUNNING

# Show jobs that ran between specific times
autorep -J BOX_PAYMENT_EOD -s -t -S 2200 -E 2359

# Export job definition back to JIL format
autorep -J BOX_PAYMENT_EOD -q
🔍
autorep -J % -s | grep FAILURE is your first command every morning during on-call. The % wildcard matches all jobs. Pipe to grep to filter by status. Add | wc -l to get a count.

Production Debug Guide

Systematic recovery paths for the most common AutoSys production failures. Each pattern includes the symptom, diagnosis steps, and the correct fix.

Pattern 1: Box stuck in RUNNING, all child jobs SUCCESS

The box has a condition dependency on a job outside the box that hasn't completed, or a child job was added after the box started and never ran. Check with autorep -J BOX_NAME -d to see the box's own conditions, then check all child job statuses.

Pattern 2: Job in STARTING for more than 5 minutes

The remote agent on the target machine is down, unreachable, or the port is blocked. SSH to the target machine and check: ps -ef | grep cybAgent. If the process is missing, restart the agent. If the process is running, check firewall rules between the Event Server and the agent machine.

Pattern 3: Jobs not starting despite conditions being met

The Event Server may be down or the evaluation cycle is delayed. Check the EAS process: ps -ef | grep EAS. Also check the AutoSys database connection — if the EAS can't reach the DB, it queues internally but doesn't dispatch. Look at /opt/CA/WorkloadAutomationAE/SystemAgent/logs/ for EAS error logs.

Production Incident — Calendar Mismatch

Symptom: End-of-month jobs didn't run. No failures, just no execution. Jobs show INACTIVE.

Root cause: The box had run_calendar: BUSINESS_DAYS and the last day of the month fell on a Saturday. The calendar had no entry for that date, so the box never activated.

Fix: Updated the calendar definition to include the last business day of each month using extended_calendar attributes. Manually force-started the box for the missed date using sendevent -E FORCE_STARTJOB -J BOX_EOM_REPORTS -D 20260330.

Lesson: Always test calendar logic with autorep -c CALENDAR_NAME -t before deploying month-end or quarter-end jobs.

Global Variables

Global variables are key-value pairs stored in the AutoSys database that any job on any machine can read or write at runtime. They are the primary mechanism for passing data between jobs in a pipeline — file names, run dates, record counts, flags — without hardcoding values into scripts.

Plain English
"Global variables are AutoSys's shared whiteboard. Any job can write a value on it, and any downstream job can read it. The extract job writes today's run date. The load job reads it. No hardcoding, no file passing."

Defining and Setting Global Variables

global_variables.jil
JIL
/* Define a global variable in JIL */
insert_global: GVAR_RUN_DATE
value: "20260417"

insert_global: GVAR_PAYMENT_FILE
value: ""

/* Reference a global variable in a job command */
insert_job: JOB_EXTRACT_TXN   job_type: CMD
command: /opt/etl/extract.sh $$GVAR_RUN_DATE
machine: db-extract-01

/* Use global variable in a condition */
insert_job: JOB_LOAD_TXN   job_type: CMD
command: /opt/etl/load.sh $$GVAR_PAYMENT_FILE
condition: success(JOB_EXTRACT_TXN) AND value(GVAR_PAYMENT_FILE) != ""

Setting Global Variables at Runtime

set_global_at_runtime.sh
BASH
# Set a global variable from within a running job script
sendevent -E SET_GLOBAL -G GVAR_PAYMENT_FILE -V "/data/incoming/payment_20260417.csv"

# Set from command line (operator intervention)
sendevent -E SET_GLOBAL -G GVAR_RUN_DATE -V "20260417"

# Read current value
autorep -G GVAR_RUN_DATE

# List all global variables
autorep -G %
Global variables in conditions use the value(GVAR_NAME) syntax. This lets you gate job execution on data state, not just job completion state — a powerful pattern for file-driven pipelines where you need to verify a file path was actually set before attempting to process it.
Production Gotcha — Global Variable Persistence

Problem: Global variables persist across box runs. If last night's run set GVAR_PAYMENT_FILE to a stale path and tonight's file watcher failed silently, the load job may process yesterday's file again.

Fix: Always reset critical global variables to empty at the start of each box run using a dedicated reset job as the first step, before any file watchers or extract jobs run.

Security Model — RBAC and EEM

AutoSys security controls who can view, modify, run, and force-start jobs. In enterprise environments this is critical — you do not want a developer accidentally force-starting a production EOD pipeline. The security model has two layers: JIL permissions (legacy, per-job) and EEM (CA Embedded Entitlements Manager) (modern, role-based).

JIL Permission Attributes

Every job definition includes a permission attribute that controls access at the job level. Permissions are set as a comma-separated list of access codes.

job_permissions.jil
JIL
insert_job: JOB_PAYMENT_EOD   job_type: CMD
owner:       svc_autosys
permission:  gx,ge,wx,we,mx,me

/* Permission codes:
   g = group, w = world, m = me (owner)
   x = execute (run/force-start)
   e = edit (modify JIL definition)
   r = read (view job definition)

   gx,ge = group can execute and edit
   wx,we = world can execute and edit
   mx,me = owner can execute and edit

   Production best practice: gx,ge only
   Never wx or we in production — too permissive */

EEM Role-Based Access Control

EEM is the modern RBAC layer. It allows administrators to define roles (operator, developer, admin, read-only) and map them to LDAP/AD groups. This means access is managed centrally via your directory service rather than per-job JIL attributes.

RoleTypical PermissionsUse Case
Read-OnlyView job status and history onlyBusiness analysts, audit
OperatorForce-start, hold, ice, change statusOn-call support engineers
DeveloperInsert/update job definitions in dev/testApplication developers
AdminFull access including delete and security changesAutoSys team only
💡
Principle of least privilege applies here. On-call engineers need Operator access, not Admin. Developers need Developer access in non-production only. Audit every Admin account quarterly — in most enterprises, there are always stale accounts with Admin access that were never revoked.

High Availability — Dual Event Server

A single Event Server is a single point of failure. If it goes down, all job scheduling stops. For 24/7 enterprise environments, AutoSys supports a Dual Event Server (Primary/Shadow) configuration that provides automatic failover.

How Primary/Shadow Works

Two Event Server instances run simultaneously — a Primary that actively schedules and dispatches, and a Shadow that stays in sync by reading the same database. If the Primary fails, the Shadow detects the absence of heartbeat and promotes itself to Primary automatically, typically within 30-60 seconds.

Dual Event Server — HA Architecture
Primary EAS (Active)
Shadow EAS (Standby)
↓ both read/write ↓
AutoSys Database (shared)
↓ Primary dispatches to
Remote Agents
On Primary failure → Shadow detects missing heartbeat → promotes to Primary → resumes scheduling within ~60s

Key Configuration Points

  • Both servers must have identical AutoSys software versions and patch levels
  • The database must be on shared/clustered storage accessible to both servers
  • Network latency between Primary and Shadow should be under 10ms — same datacenter preferred
  • Remote agents connect to a virtual hostname or load balancer VIP, not the physical server name — this survives failover transparently
  • Monitor heartbeat with autoping -m EAS_MACHINE_NAME — include this in your monitoring stack
Production Incident — Split Brain

Symptom: Jobs ran twice. Duplicate records in the data warehouse.

Root cause: Network partition between Primary and Shadow. Shadow couldn't see Primary's heartbeat, promoted itself. Network recovered — now both servers thought they were Primary and dispatched the same jobs.

Fix: Immediately shut down one EAS instance. Identify duplicated records via run history timestamps. Implement a fencing mechanism (STONITH or database-based lock) to prevent dual-active scenarios.

Prevention: Use a dedicated heartbeat network interface separate from the data network. Configure the Shadow with a longer promotion timeout to survive brief network blips.

Modern Integrations — REST API, Cloud & Containers

AutoSys r12.x and later supports integrations well beyond shell scripts on bare-metal Linux. In 2026 enterprise environments, AutoSys pipelines routinely trigger REST APIs, run containerised workloads, and dispatch jobs to cloud agents.

REST API / Web Services Jobs

The Web Services job type (job_type: WS) allows AutoSys to call REST or SOAP endpoints directly as a job — no wrapper script needed. The job succeeds or fails based on the HTTP response code.

rest_api_job.jil
JIL
/* Web Services job — calls a REST endpoint */
insert_job: JOB_TRIGGER_RISK_ENGINE   job_type: WS
box_name:       BOX_PAYMENT_EOD
web_svc_url:    https://risk-engine.internal/api/v2/run-eod
web_svc_method: POST
web_svc_body:   {"run_date":"$$GVAR_RUN_DATE","mode":"full"}
web_svc_success_codes: 200,201,202
condition:      success(JOB_LOAD_TXN)
max_run_alarm:  15

Container Jobs

AutoSys supports running Docker containers as jobs via the DOCKER job type or via CMD jobs that invoke docker run or kubectl. The container job type manages the container lifecycle — pull, run, capture exit code, clean up.

container_job.jil
JIL
/* CMD job wrapping a Docker container run */
insert_job: JOB_RISK_CALC_CONTAINER   job_type: CMD
command: docker run --rm \
  -e RUN_DATE=$$GVAR_RUN_DATE \
  -v /data/risk:/data \
  registry.internal/risk-calculator:2.1.4 \
  --mode full-eod
machine:       docker-host-01
owner:         svc_autosys
max_run_alarm: 45

/* Kubernetes job via kubectl */
insert_job: JOB_RISK_CALC_K8S   job_type: CMD
command: kubectl create job risk-calc-$$GVAR_RUN_DATE \
  --from=cronjob/risk-calculator \
  --namespace=batch-jobs
machine:       k8s-bastion-01
max_run_alarm: 60

Cloud Agents

AutoSys cloud agents run on AWS EC2, Azure VMs, or GCP instances and register back to the on-premises Event Server. From AutoSys's perspective they are just another machine attribute — the job definition is identical. Cloud agents enable hybrid pipelines where on-premises extract jobs feed cloud-based transformation and ML workloads.

For Kubernetes workloads, the pattern of creating a job from a CronJob template works well — it inherits all resource limits, image pull secrets, and service account bindings from the template. Use kubectl wait --for=condition=complete job/risk-calc-DATE --timeout=3600s in a subsequent CMD job to gate downstream AutoSys jobs on Kubernetes job completion.

The jil Command — Applying and Managing JIL

JIL defines your jobs, but the jil command is what loads those definitions into the AutoSys database. Every engineer needs to know the three ways to run it and how to safely validate before committing.

Applying JIL — Three Methods

applying_jil.sh
BASH
# Method 1: Redirect a JIL script file (most common in production)
jil < payment_eod.jil

# Method 2: Interactive mode — type JIL statements directly, Ctrl+D to commit
jil
# jil> insert_job: JOB_TEST  job_type: CMD
# jil> command: /opt/test.sh
# jil> machine: proc-01
# jil> ^D   (Ctrl+D commits to database)

# Method 3: Validate syntax WITHOUT committing to database
# Always run this before applying in production
jil -syntax < payment_eod.jil
# If valid: no output and exit code 0
# If invalid: error message with line number

# Update an existing job definition
update_job: JOB_EXTRACT_TXN
max_run_alarm: 90        /* change one attribute, rest unchanged */

# Delete a job
delete_job: JOB_OLD_EXTRACT

# Delete a box AND all jobs inside it
delete_box: BOX_OLD_PIPELINE

# Delete a global variable
delete_glob: GVAR_DEPRECATED_FLAG

# Register a machine in AutoSys topology
insert_machine: new-etl-server-01
max_load:    10
factor:      1.00
opsys:       LINUX
description: "New ETL processing server"
Always run jil -syntax < script.jil before applying to production. The syntax checker validates the entire script without touching the database. A single syntax error in a 200-job JIL file will abort the entire import — leaving you with a partially applied definition. Validate first, always.
💡
update_job only changes the attributes you specify. All other attributes remain exactly as they were. This is the safe way to change a single attribute on a live job without risk of accidentally resetting other settings. Use insert_job only when creating a new job or when you deliberately want to reset all attributes.
Production Incident — Partial JIL Import

Symptom: Only 40 of 60 jobs in a JIL script were created. The other 20 were missing with no error in the AutoSys logs.

Root cause: A syntax error on line 180 caused the jil command to abort mid-import. Jobs defined after line 180 were never loaded.

Fix: Ran jil -syntax < script.jil, found the error (a missing colon in a condition attribute), fixed it, re-ran the full import. Jobs already created by the partial import needed to be manually deleted first.

Prevention: Always validate with -syntax before importing. In CI/CD pipelines, add jil -syntax as a mandatory gate before any JIL deployment.

Virtual Machines and Load Balancing

AutoSys virtual machines are not VMs in the hypervisor sense — they are logical groups of real agent machines. When a job targets a virtual machine, AutoSys automatically selects which real machine to dispatch it to based on a configurable load-balancing method. This is essential for high-throughput batch environments where hundreds of jobs need to be distributed across a pool of agents.

virtual_machine.jil
JIL
/* Define a virtual machine containing 3 real agents */
insert_machine: VM_ETL_POOL
type:           v           /* v = virtual machine */
machine_method: ROUNDROBIN  /* distribute jobs evenly */
real_machines:  etl-proc-01 etl-proc-02 etl-proc-03
description:    "ETL processing pool — 3 agents"

/* Job targets the virtual machine — AutoSys picks the real agent */
insert_job: JOB_PROCESS_BATCH   job_type: CMD
command:    /opt/etl/process.sh
machine:    VM_ETL_POOL     /* targets pool, not specific machine */
max_run_alarm: 30

Load Balancing Methods

MethodHow It WorksBest For
ROUNDROBINJobs distributed sequentially across all available agentsUniform job sizes, simple pools
CPU_MONJob sent to agent with lowest current CPU usageMixed workloads with variable CPU demand
JOB_LOADUses job_load and max_load attributes to track theoretical loadJobs with known resource weights
💡
ROUNDROBIN is the safest default for most enterprise pools. CPU_MON requires the rstatd daemon running on all target machines — if it's not running, AutoSys falls back to CPU_MON silently, which may not distribute as expected. Confirm rstatd status before using CPU_MON in production.

Advanced Condition Syntax

The condition attribute supports more status types and patterns than most engineers know. These are the ones that appear in production and in interviews.

advanced_conditions.jil
JIL
/* done() — runs if job is in ANY completed state (SUCCESS, FAILURE, TERMINATED)
   Use when you want to proceed regardless of how the upstream job ended */
condition: done(JOB_OPTIONAL_CLEANUP)

/* notrunning() — runs only if the specified job is NOT currently executing
   Use to prevent two jobs running on the same resource simultaneously */
condition: notrunning(JOB_DB_MAINTENANCE)

/* failure() with lookback — trigger if upstream failed recently
   Useful for alerting jobs that should only fire on fresh failures */
condition: failure(JOB_PAYMENT_FEED, 01.00)

/* Lookback using colon syntax — must escape colon with backslash
   Both formats are valid: 01.30 and 01\:30 */
condition: success(JOB_RISK_ENGINE, 01\:30)

/* Combination — complex real-world condition */
condition: success(JOB_EXTRACT, 02.00) AND
            notrunning(JOB_DB_BACKUP) AND
            value(GVAR_MARKET_OPEN) = "Y"

max_exit_success — Treating Non-Zero Exit Codes as Success

By default AutoSys marks a job FAILURE if it exits with any non-zero code. The max_exit_success attribute lets you define a threshold — any exit code up to and including that value is treated as SUCCESS. Critical for scripts that use exit codes to signal warnings rather than failures.

max_exit_success.jil
JIL
insert_job: JOB_DATA_VALIDATION   job_type: CMD
command:         /opt/validate/run_checks.sh
machine:         etl-proc-01
max_exit_success: 4
/* Exit codes 0-4 treated as SUCCESS
   Exit code 0 = all checks passed
   Exit codes 1-4 = warnings (some checks failed but acceptable)
   Exit code 5+ = FAILURE (critical errors) */
💡
ON_ICE affects lookback evaluation. Per official docs: if a predecessor job is in ON_ICE status, any lookback condition on it always evaluates to true — the lookback window is ignored. This means if you ice a job to skip it, downstream jobs with lookback conditions will proceed as if it succeeded, regardless of the lookback window.

Resources — Concurrency Control

Resources are named counters that limit how many jobs can run simultaneously against a shared asset — a database, a file system, an API endpoint. Without resources, AutoSys will dispatch as many concurrent jobs as conditions allow, which can overwhelm downstream systems.

resources.jil
JIL
/* Define a resource — max 3 concurrent DB connections */
insert_resource: DB_CONNECTIONS
quantity: 3
description: "Max concurrent Oracle DW connections"

/* Jobs that consume this resource — each consumes 1 unit */
insert_job: JOB_LOAD_PAYMENTS   job_type: CMD
command:    /opt/etl/load_payments.sh
machine:    etl-proc-01
resources:  DB_CONNECTIONS      /* consumes 1 unit */

insert_job: JOB_LOAD_TRADES   job_type: CMD
command:    /opt/etl/load_trades.sh
machine:    etl-proc-02
resources:  DB_CONNECTIONS      /* waits if 3 already running */

/* A heavy job consuming multiple units */
insert_job: JOB_BULK_LOAD   job_type: CMD
command:    /opt/etl/bulk_load.sh
machine:    etl-proc-01
resources:  DB_CONNECTIONS(2)   /* consumes 2 units — counts as 2 connections */
Resources are the correct way to prevent database overload — not by adding artificial conditions or sleep commands. When a job needs a resource unit that's fully consumed, it stays in ACTIVATED state waiting. Use autorep -r DB_CONNECTIONS -s to check current resource utilisation.

FTP Job Type

The FTP job type (job_type: FTP) transfers files between servers natively without a wrapper script. AutoSys manages the FTP connection, handles authentication, and reports success or failure based on the transfer result. It replaces fragile shell scripts that call ftp or sftp manually.

ftp_job.jil
JIL
/* FTP job — download file from remote server */
insert_job: JOB_FTP_GET_PAYMENT   job_type: FTP
box_name:       BOX_PAYMENT_EOD
ftp_machine:    sftp.partner-bank.com
ftp_user:       svc_transfer
ftp_password:   %%ENCRYPTED_PASSWORD%%
ftp_src_file:   /outgoing/payment_$$GVAR_RUN_DATE.csv
ftp_dest_file:  /data/incoming/payment_$$GVAR_RUN_DATE.csv
ftp_dest_dir:   /data/incoming
machine:        file-drop-01   /* agent that performs the transfer */
description:    "Download daily payment file from partner bank"
max_run_alarm:  15

/* FTP job — upload results to remote server */
insert_job: JOB_FTP_PUT_REPORT   job_type: FTP
box_name:       BOX_PAYMENT_EOD
ftp_machine:    reporting.internal
ftp_user:       svc_reports
ftp_src_file:   /data/reports/eod_$$GVAR_RUN_DATE.csv
ftp_dest_dir:   /reports/incoming
machine:        file-drop-01
condition:      success(JOB_GENERATE_REPORT)
💡
Never hardcode FTP passwords in plain text JIL. Use AutoSys credential management or encrypted password references. Plain text passwords in JIL files are a security audit finding in every enterprise. The %%ENCRYPTED_PASSWORD%% pattern uses the AutoSys credential vault.

Calendars — Advanced Scheduling

Calendars in AutoSys define custom sets of dates for job scheduling — business days, trading days, month-end dates, fiscal periods. Instead of hardcoding days_of_week and run_window, calendars let you maintain a central date authority that all jobs reference.

Defining Calendars in JIL

calendars.jil
JIL
/* Standard calendar — specific dates the job SHOULD run */
insert_calendar: CAL_TRADING_DAYS_2026
datetimes:  01/02/2026 01/05/2026 01/06/2026 01/07/2026
            01/08/2026 01/09/2026  /* add all trading days */
description: "NYSE trading days 2026"

/* Extended calendar — calculates dates by rule
   last_business_day: runs on last business day of each month */
insert_calendar: CAL_MONTH_END
type:        extended
definition:  "last_business_day"
description: "Last business day of each month"

/* Exception calendar — dates the job should NOT run
   Use with run_calendar to exclude holidays */
insert_calendar: CAL_HOLIDAYS_2026
datetimes:  01/01/2026 05/25/2026 07/04/2026 12/25/2026

/* Apply calendar to a box job */
insert_job: BOX_TRADING_EOD   job_type: BOX
run_calendar:  CAL_TRADING_DAYS_2026
exclude_calendar: CAL_HOLIDAYS_2026
start_times:  "18:00"

Checking and Diagnosing Calendars

calendar_commands.sh
BASH
# List all dates in a calendar — critical for month-end debugging
autorep -c CAL_TRADING_DAYS_2026 -t

# List all defined calendars
autorep -c %

# Check next scheduled run dates for a box
autorep -J BOX_TRADING_EOD -d | grep -i calendar

# Forecast when a job will next run (shows upcoming scheduled dates)
forecast -J BOX_TRADING_EOD -t 30
Production Incident — Missing Calendar Dates

Symptom: Month-end reporting jobs didn't run in March. No failures — jobs show INACTIVE the entire day.

Root cause: The calendar CAL_BUSINESS_DAYS_2026 was built from a template that didn't account for March 31 being a Tuesday (valid business day). A data entry error left it off the datetimes list.

Fix: autorep -c CAL_BUSINESS_DAYS_2026 -t confirmed March 31 was missing. Added it with update_calendar JIL, then force-started the box with sendevent -E FORCE_STARTJOB -J BOX_EOM_REPORTS -D 20260331.

Prevention: After building any annual calendar, run autorep -c CALENDAR_NAME -t and manually verify the count of dates matches expected business days for the year.

job_depends — Validating Condition References

One of the most common silent failures in AutoSys: a job has a condition referencing a job name that doesn't exist. The condition silently evaluates to false — the job never runs, no error fires, and the pipeline stalls indefinitely. job_depends is the command that catches this.

job_depends_commands.sh
BASH
# Check if all condition references in a job are valid
# Reports any job names referenced in conditions that don't exist in AutoSys
job_depends -J JOB_LOAD_TXN

# Check an entire box and all its children
job_depends -J BOX_PAYMENT_EOD

# Check all jobs in the system — run this after any large JIL import
job_depends -J %

# Example output when a dependency is broken:
# JOB_LOAD_TXN: condition job JOB_TRANSFORM_V2_TXN not found in database
# This means JOB_LOAD_TXN will never start — its condition always false
Run job_depends -J % after every JIL deployment. When jobs are renamed, deleted, or migrated between environments, condition references can become dangling pointers. job_depends catches all of them in one pass. A job with a broken condition will silently never run — there is no error, no alarm, just an ACTIVATED job that waits forever.
Production Incident — Renamed Job Broke 12 Conditions

Symptom: After a JIL refactor renaming JOB_EXTRACT to JOB_EXTRACT_TXN, 12 downstream jobs stopped running. They showed ACTIVATED indefinitely.

Root cause: The 12 jobs had condition: success(JOB_EXTRACT). The old name no longer existed. Conditions silently evaluated to false.

Fix: job_depends -J % immediately identified all 12 broken references. Updated all conditions to reference JOB_EXTRACT_TXN.

Prevention: Make job_depends -J % part of your deployment runbook. Run it after every JIL change that renames or deletes jobs.

Key Takeaways
01AutoSys is event-driven, not time-driven. Jobs run when conditions are met, not just at scheduled times.
02ON_HOLD pauses a job but blocks downstream. ON_ICE skips a job and lets downstream proceed. Know which you need before acting.
03CHANGE_STATUS is your most important recovery tool. Use it when work was done manually and you need the pipeline to continue without re-running.
04Always set max_run_alarm on long-running jobs. Without it, a hung job runs silently forever and the box never completes.
05STARTING stuck for more than 5 minutes means the remote agent is down. Check the agent before investigating the job.
06Global variables persist across box runs. Always reset critical variables at pipeline start to avoid processing stale data from previous runs.
07Use EEM roles (Operator, Developer, Admin) mapped to LDAP groups. On-call engineers need Operator access only — never hand out Admin for day-to-day support.
08In Dual Event Server HA setups, use a virtual hostname for agent connections — this makes failover transparent to all remote agents.
09Web Services job type (WS) calls REST APIs natively — no wrapper script needed. Container workloads run via CMD jobs invoking docker or kubectl on an agent machine.
10autorep -J JOB_NAME -f gives you machine, exit code, and timestamps — always use -f first when diagnosing a failure, not just -s.

Interview Questions

These are the questions that separate AutoSys operators from AutoSys engineers. All are based on real production scenarios.

Q What is the difference between ON_HOLD and ON_ICE in AutoSys?
ON_HOLD prevents a job from running but it remains visible to the scheduler — downstream jobs still wait for it to complete before proceeding. ON_ICE removes the job from scheduler consideration entirely — downstream jobs treat it as if it already succeeded and proceed without waiting. Use ON_HOLD when you want to delay execution and resume later. Use ON_ICE when you want to permanently skip a job for this cycle without blocking the pipeline.
Q A box job is stuck in RUNNING but all child jobs show SUCCESS. What happened and how do you fix it?
The most common cause is that the box itself has a condition attribute referencing a job outside the box that hasn't completed. Run autorep -J BOX_NAME -d to inspect the box's own condition. Also check if any child job is in a state other than SUCCESS — autorep sometimes truncates output. If the box condition references an external job, check that job's status and either fix it or force its status to SUCCESS. If all conditions are genuinely met, try sendevent -E FORCE_STARTJOB -J BOX_NAME to re-evaluate.
Q When would you use CHANGE_STATUS instead of FORCE_STARTJOB?
Use CHANGE_STATUS when the work represented by the job has already been completed outside of AutoSys — for example, a DBA manually ran the SQL that the job would have executed, or a file transfer was done manually. In this case you don't want the job to re-run (which could cause duplicates or conflicts), you just want AutoSys to know it's done so downstream jobs can proceed. FORCE_STARTJOB actually executes the job again, which you only want when the job genuinely needs to re-run.
Q How do you restart an entire failed pipeline from a specific step without re-running steps that already succeeded?
You cannot directly restart from a specific step, but you can achieve the same result. First, use CHANGE_STATUS to mark all jobs that already succeeded as SUCCESS (if they're not already). Then FORCE_STARTJOB on the specific job that failed. AutoSys will re-evaluate conditions and since the earlier jobs show SUCCESS, the failed job's conditions are met and it will run. Any jobs downstream with condition: success(FAILED_JOB) will automatically queue once the re-run succeeds.
Q What does n_retrys do and what are its limitations?
n_retrys tells AutoSys to automatically restart a failed job up to N times before marking it as permanently FAILURE. Combined with retry_interval (minutes between retries), it handles transient failures like network timeouts. The key limitation: retries only apply to FAILURE exit codes, not to TERMINATED status (jobs killed via KILLJOB). Also, if the job fails during a box run and the box completes (because other jobs don't depend on this one), retries stop — the box ending resets all job states.
Q What is the difference between FORCE_STARTJOB and STARTJOB?
STARTJOB starts a job only if its conditions are met — it respects the job's condition attribute and will not start if conditions are unsatisfied. FORCE_STARTJOB bypasses all conditions, time windows, and date restrictions and starts the job immediately regardless of its state. Use STARTJOB to trigger a job within its normal logic. Use FORCE_STARTJOB for emergency recovery when you need to override everything.
Q A job is stuck in STARTING for 15 minutes. Walk through your exact diagnosis steps.
Step 1: Run autoping -m MACHINE_NAME — if it fails, the agent is unreachable. Fix the agent first (restart cybagent service). Step 2: If autoping succeeds, run autorep -M MACHINE_NAME -r to confirm the machine alias is registered correctly in AutoSys topology. Step 3: SSH to the target machine and verify the owner account exists: id svc_autosys. If the user doesn't exist, the job stays in STARTING indefinitely with no error. Step 4: Check the owner has execute permission on the script path. Step 5: Check EAS logs for dispatch errors.
Q What is a lookback condition and when would you use one?
A lookback condition restricts a dependency to a success within a specific time window — for example, success(JOB_MARKET_FEED, 00.30) means the job only satisfies the condition if it succeeded within the last 30 minutes. Without a lookback, a job that succeeded hours or even days ago still satisfies the condition. Lookbacks are essential for time-sensitive pipelines — market data feeds, regulatory cutoffs, real-time settlement — where stale data from a previous run completing the condition would be dangerous.
Q How do global variables work in AutoSys and what is their biggest production risk?
Global variables are key-value pairs stored in the AutoSys database, accessible to any job on any machine using the $$VARNAME syntax in commands or value(VARNAME) in conditions. They're set via sendevent -E SET_GLOBAL or defined in JIL with insert_global. The biggest production risk is persistence — global variables retain their value across box runs. If last night's run set GVAR_PAYMENT_FILE to a specific path and tonight's file watcher fails silently, the variable still holds yesterday's path and downstream jobs will process stale data. Always reset critical variables at the start of each pipeline run.
Q What does the machine attribute in JIL reference — the hostname, IP, or something else?
The machine attribute references the alias registered in the AutoSys topology database, not the server's actual hostname or IP address. These can be identical, but they don't have to be. You can verify registered machine aliases with autorep -M % -r. Using the wrong value — the actual hostname when the topology alias is different — leaves the job stuck in STARTING indefinitely because the Event Server cannot resolve the target agent. Always confirm the topology alias before defining a new job for a new machine.
Q What is the difference between alarm_if_fail and alarm_if_terminated?
alarm_if_fail triggers an alarm when a job exits with a non-zero exit code (FAILURE status). alarm_if_terminated triggers an alarm when a job is killed via KILLJOB or by max_run_alarm expiry (TERMINATED status). In production, set both on critical jobs — a job that hangs and gets killed by max_run_alarm will not trigger alarm_if_fail because its status is TERMINATED, not FAILURE. Without alarm_if_terminated, a silently hung-and-killed job can go unnoticed.
Q How does AutoSys High Availability work, and what is split-brain?
AutoSys HA uses a Primary/Shadow Event Server configuration sharing the same database. The Shadow monitors the Primary's heartbeat — if the heartbeat stops, the Shadow promotes itself to Primary and resumes scheduling within ~60 seconds. Split-brain occurs when a network partition causes the Shadow to lose the heartbeat temporarily even though the Primary is still alive. Both servers promote themselves to Primary simultaneously and dispatch the same jobs, causing duplicate runs. Prevention: dedicated heartbeat network interface, a longer promotion timeout on the Shadow, and a database-level fencing lock to prevent dual-active.
Q What is the permission attribute in JIL and what are the risks of using wx or we?
The permission attribute controls who can execute (x) and edit (e) a job. The prefixes are g (group), w (world), and m (me/owner). wx means any user in the system can force-start the job; we means any user can modify the JIL definition. In enterprise environments, wx and we are serious security risks — any developer or operator account could accidentally or maliciously modify or trigger a production settlement job. Best practice is gx,ge only, mapping the group to a controlled AD/LDAP group via EEM.
Q How would you use AutoSys to call a REST API as part of a pipeline?
Use the Web Services job type (job_type: WS), which is native in AutoSys r12.x+. Define web_svc_url, web_svc_method (GET/POST), web_svc_body for the request payload, and web_svc_success_codes for the HTTP codes that constitute success (typically 200,201,202). The job succeeds or fails based on the response code — no wrapper script needed. Set max_run_alarm to handle hung connections. Global variables can be interpolated into the URL or body using $$GVAR_NAME syntax.
Q A calendar-driven box didn't run on the last business day of the month. What are the likely causes?
The most common cause is a calendar definition that doesn't account for month-end on a weekend. If run_calendar: BUSINESS_DAYS is used and the last day of the month falls on Saturday, the box has no trigger date for that run. Other causes: the calendar was updated and the change wasn't applied to all environments; the box has a days_of_week restriction that conflicts with the calendar; or the Event Server was down during the scheduled window and the missed run wasn't caught up. Diagnose with autorep -c CALENDAR_NAME -t to see all scheduled dates.
Q What is the difference between std_out_file and the job log in AutoSys?
std_out_file captures the script's stdout on the remote agent machine — whatever the shell script prints to standard output. The job log in the AutoSys database captures the job's lifecycle events: when it was dispatched, which agent received it, the exit code, start and end timestamps. Both are essential for debugging: the job log tells you what AutoSys did, the std_out_file tells you what the script did. If std_out_file is not set, stdout is lost when the job completes. Use $DATE in the filename to preserve one log per run rather than overwriting.
Q How do you export an existing job definition back to JIL format?
Use autorep -J JOB_NAME -q — the -q flag outputs the job definition in JIL format that can be piped directly to the jil command on another instance. This is the standard way to copy jobs between environments (dev → test → prod), create backups before making changes, or document existing job definitions. For a full box including all children: autorep -J BOX_NAME -q exports the box and every job inside it.
Q What happens to running jobs when the Primary Event Server fails over to the Shadow?
Jobs already dispatched and running on remote agents continue executing — agents run independently and don't require a live Event Server connection to complete a job. The agent writes the exit code and status back to the database when the job completes, and the new Primary Event Server picks up those results on its next database poll. Jobs that were in STARTING or ACTIVATED state at the moment of failover may need to be force-started manually — the transition can cause them to be skipped if the new Primary doesn't see their dispatch record.
Q How do you check why a global variable condition is preventing a job from starting?
Run autorep -G GVAR_NAME to see the current value of the variable. If it's empty or set to an unexpected value, that's why the condition fails. Also run autorep -J JOB_NAME -d to inspect the job's condition attribute and confirm exactly what value the condition is checking. Common scenario: a job has condition: value(GVAR_RUN_DATE) != "" but the variable was never set because an earlier job that calls SET_GLOBAL failed. Fix by manually setting: sendevent -E SET_GLOBAL -G GVAR_RUN_DATE -V "20260417", then force-starting the blocked job.
Q What is the purpose of min_file_size in a File Watcher job and what problem does it solve?
min_file_size sets the minimum file size in bytes that the watched file must reach before the FW job considers it a success. Setting it to 1 prevents the job from triggering on an empty file — a common failure mode in file-based pipelines where the upstream system creates the file immediately but writes data to it over time. Without min_file_size, the FW job succeeds the instant the file appears (even with 0 bytes), the downstream CMD job starts, and attempts to process an empty file. This causes subtle failures that are hard to diagnose because the file exists but contains no data.
Q How do you force-run a box for a specific historical date that was missed?
Use the -D flag with sendevent: sendevent -E FORCE_STARTJOB -J BOX_NAME -D YYYYMMDD. This triggers the box as if it were running on the specified date, which is critical for date-aware jobs that use $$DATE or date-based global variables. Without the -D flag, FORCE_STARTJOB runs the box with today's date, which would cause the jobs to process the wrong data set. Always confirm the date format matches your AutoSys configuration — some environments use MMDDYYYY.
Q What is the EEM role model in AutoSys and how does it differ from JIL permissions?
JIL permissions (permission: gx,ge) are per-job access controls defined in the job definition itself — they control who can execute or edit that specific job based on OS group membership. EEM (Embedded Entitlements Manager) is the centralized RBAC layer that maps roles (Operator, Developer, Admin) to LDAP/AD groups across all jobs in the instance. EEM supersedes JIL permissions in modern AutoSys deployments — it allows consistent access control without touching individual job definitions. Use EEM when you need enterprise-wide role enforcement; use JIL permissions as a secondary layer for job-specific restrictions.