Everything you need to master CA AutoSys Workload Automation — from JIL syntax to production incident recovery. Written for engineers who actually run these systems at scale.
AutoSys (now officially CA Workload Automation AE, owned by Broadcom) is an enterprise job scheduling system that runs batch jobs, file watchers, and complex multi-step workflows across distributed systems. It is the backbone of nightly processing in investment banks, insurance companies, telcos, and healthcare systems — the infrastructure that runs payroll, bank settlements, ETL pipelines, and regulatory reporting at 2am every night.
The key difference from cron is that AutoSys jobs are event-driven. A job doesn't just run at a time — it runs when conditions are met: another job succeeded, a file appeared in a directory, a global variable changed, or an external trigger fired. This makes it the coordination layer for complex enterprise pipelines where Job B must wait for Job A, and Job C must wait for both.
| Feature | Cron | AutoSys | Control-M |
|---|---|---|---|
| Job dependencies | None | Full condition logic | Full condition logic |
| Cross-server jobs | Per-server only | Remote agents | Remote agents |
| File watchers | No | Native FW job type | Native |
| Alerting | Manual setup | Built-in alarms | Built-in |
| Restart/recovery | Manual | sendevent commands | GUI + commands |
| Audit trail | Logs only | Full event history in DB | Full event history |
| Typical use | Simple scheduled tasks | Enterprise batch pipelines | Enterprise batch pipelines |
AutoSys has three main components: the Event Server (EAS), Remote Agents, and the AutoSys database. Understanding how they interact is essential for diagnosing failures and designing reliable pipelines.
The Event Server is the brain. It reads events from the database queue, evaluates job conditions, and dispatches ready jobs to remote agents. It runs as a daemon process (EAS) and must be running for any job scheduling to occur. If the Event Server goes down, jobs queue but do not run.
Agents are lightweight daemons installed on target machines. They receive job dispatch instructions from the Event Server, execute the command in the job's owner context, capture exit codes and stdout/stderr, and report status back. A machine without an agent cannot run AutoSys jobs.
Symptom: Jobs targeting a specific machine are stuck in STARTING state for 10+ minutes, then fail with "connection refused."
Root cause: The Remote Agent daemon on the target machine crashed after an OS patch restart. The Event Server couldn't reach it.
Fix: SSH to the target machine, check agent status with ps -ef | grep cybAgent, restart with service cybagent restart, then force-start the affected jobs.
Prevention: Add agent health monitoring to your infrastructure alerting. A dead agent is silent — it doesn't alert AutoSys directly.
JIL (Job Information Language) is the DSL used to define every object in AutoSys — jobs, boxes, file watchers, global variables, and machine definitions. Every AutoSys engineer needs to read and write JIL fluently.
/* BOX — container for the entire EOD pipeline */ insert_job: BOX_PAYMENT_EOD job_type: BOX owner: svc_autosys permission: gx,ge,wx,we,mx,me date_conditions: 1 days_of_week: mo,tu,we,th,fr start_times: "22:00" timezone: GMT description: "End-of-day payment settlement pipeline" alarm_if_fail: 1 alarm_if_terminated: 1
/* Step 1: Extract — no condition, runs when box starts */ insert_job: JOB_EXTRACT_TXN job_type: CMD box_name: BOX_PAYMENT_EOD command: /opt/etl/extract_transactions.sh machine: db-extract-01 owner: svc_autosys std_out_file: /logs/autosys/extract_txn.out std_err_file: /logs/autosys/extract_txn.err max_run_alarm: 60 /* Step 2: Transform — depends on extract success */ insert_job: JOB_TRANSFORM_TXN job_type: CMD box_name: BOX_PAYMENT_EOD command: /opt/etl/transform.sh machine: etl-proc-01 owner: svc_autosys condition: success(JOB_EXTRACT_TXN) max_run_alarm: 45 /* Step 3: Load — depends on transform, alerts on failure */ insert_job: JOB_LOAD_TXN job_type: CMD box_name: BOX_PAYMENT_EOD command: /opt/etl/load_to_dw.sh machine: dw-loader-01 owner: svc_autosys condition: success(JOB_TRANSFORM_TXN) alarm_if_fail: 1 max_run_alarm: 30 n_retrys: 2 retry_interval: 5
/* File watcher — triggers when file arrives */ insert_job: FW_PAYMENT_FILE job_type: FW box_name: BOX_PAYMENT_EOD watch_file: /data/incoming/payment_*.csv watch_interval: 60 owner: svc_autosys machine: file-drop-01 min_file_size: 1 /* Process job runs only after file arrives */ insert_job: JOB_PROCESS_PAYMENT job_type: CMD box_name: BOX_PAYMENT_EOD command: /opt/payment/process.sh machine: proc-01 condition: success(FW_PAYMENT_FILE)
Every AutoSys job cycles through a series of status codes. Knowing what each one means — and what caused it — is the core skill for on-call AutoSys support.
| Status | Meaning | Common Cause |
|---|---|---|
| INACTIVE | Job exists but has never run in this cycle | Normal initial state |
| ACTIVATED | Box started, job is waiting for its conditions | Normal — waiting for dependencies |
| STARTING | Event Server dispatching to remote agent | Normal, should be brief (<30s) |
| RUNNING | Executing on remote agent | Normal during execution |
| SUCCESS | Completed with exit code 0 | Normal completion |
| FAILURE | Completed with non-zero exit code | Script error, bad data, permissions |
| TERMINATED | Killed via sendevent KILLJOB | Manual intervention or max_run_alarm |
| ON_HOLD | Held by operator — will not run | Manual hold, maintenance window |
| ON_ICE | Frozen — invisible to scheduler | Manual skip, downstream proceeds |
| RESTART | Waiting to be restarted after failure | n_retrys configured, retry pending |
When something breaks in production, sendevent is your primary intervention tool. Every AutoSys engineer needs these commands available at 3am without Googling.
Scenario: JOB_LOAD_TXN failed at 23:45 due to a tablespace issue. The DBA fixed the tablespace and manually inserted the data directly. The EOD box is stuck waiting for the load job to succeed.
Wrong fix: sendevent -E FORCE_STARTJOB -J JOB_LOAD_TXN — this re-runs the load script, which will attempt to insert the data again and create duplicates.
Correct fix: sendevent -E CHANGE_STATUS -J JOB_LOAD_TXN -s SUCCESS — marks the job as succeeded without running it, allowing downstream jobs to proceed. No duplicate data.
autorep is the read-only companion to sendevent. It queries the AutoSys database and reports on job status, history, and definitions. Master these commands for rapid diagnosis.
# Show current status of a specific job autorep -J JOB_LOAD_TXN -s # Show all jobs in a box with their current status autorep -J BOX_PAYMENT_EOD -s # Show job definition (JIL attributes) autorep -J JOB_LOAD_TXN -d # Show run history for a job (last 10 runs) autorep -J JOB_LOAD_TXN -s -t # Show all FAILURE jobs in the last 24 hours autorep -J % -s | grep FAILURE # Show all jobs on a specific machine autorep -M db-extract-01 -s # Show all currently RUNNING jobs autorep -J % -s | grep RUNNING # Show jobs that ran between specific times autorep -J BOX_PAYMENT_EOD -s -t -S 2200 -E 2359 # Export job definition back to JIL format autorep -J BOX_PAYMENT_EOD -q
% wildcard matches all jobs. Pipe to grep to filter by status. Add | wc -l to get a count.Systematic recovery paths for the most common AutoSys production failures. Each pattern includes the symptom, diagnosis steps, and the correct fix.
The box has a condition dependency on a job outside the box that hasn't completed, or a child job was added after the box started and never ran. Check with autorep -J BOX_NAME -d to see the box's own conditions, then check all child job statuses.
The remote agent on the target machine is down, unreachable, or the port is blocked. SSH to the target machine and check: ps -ef | grep cybAgent. If the process is missing, restart the agent. If the process is running, check firewall rules between the Event Server and the agent machine.
The Event Server may be down or the evaluation cycle is delayed. Check the EAS process: ps -ef | grep EAS. Also check the AutoSys database connection — if the EAS can't reach the DB, it queues internally but doesn't dispatch. Look at /opt/CA/WorkloadAutomationAE/SystemAgent/logs/ for EAS error logs.
Symptom: End-of-month jobs didn't run. No failures, just no execution. Jobs show INACTIVE.
Root cause: The box had run_calendar: BUSINESS_DAYS and the last day of the month fell on a Saturday. The calendar had no entry for that date, so the box never activated.
Fix: Updated the calendar definition to include the last business day of each month using extended_calendar attributes. Manually force-started the box for the missed date using sendevent -E FORCE_STARTJOB -J BOX_EOM_REPORTS -D 20260330.
Lesson: Always test calendar logic with autorep -c CALENDAR_NAME -t before deploying month-end or quarter-end jobs.
Global variables are key-value pairs stored in the AutoSys database that any job on any machine can read or write at runtime. They are the primary mechanism for passing data between jobs in a pipeline — file names, run dates, record counts, flags — without hardcoding values into scripts.
/* Define a global variable in JIL */ insert_global: GVAR_RUN_DATE value: "20260417" insert_global: GVAR_PAYMENT_FILE value: "" /* Reference a global variable in a job command */ insert_job: JOB_EXTRACT_TXN job_type: CMD command: /opt/etl/extract.sh $$GVAR_RUN_DATE machine: db-extract-01 /* Use global variable in a condition */ insert_job: JOB_LOAD_TXN job_type: CMD command: /opt/etl/load.sh $$GVAR_PAYMENT_FILE condition: success(JOB_EXTRACT_TXN) AND value(GVAR_PAYMENT_FILE) != ""
# Set a global variable from within a running job script sendevent -E SET_GLOBAL -G GVAR_PAYMENT_FILE -V "/data/incoming/payment_20260417.csv" # Set from command line (operator intervention) sendevent -E SET_GLOBAL -G GVAR_RUN_DATE -V "20260417" # Read current value autorep -G GVAR_RUN_DATE # List all global variables autorep -G %
value(GVAR_NAME) syntax. This lets you gate job execution on data state, not just job completion state — a powerful pattern for file-driven pipelines where you need to verify a file path was actually set before attempting to process it.Problem: Global variables persist across box runs. If last night's run set GVAR_PAYMENT_FILE to a stale path and tonight's file watcher failed silently, the load job may process yesterday's file again.
Fix: Always reset critical global variables to empty at the start of each box run using a dedicated reset job as the first step, before any file watchers or extract jobs run.
AutoSys security controls who can view, modify, run, and force-start jobs. In enterprise environments this is critical — you do not want a developer accidentally force-starting a production EOD pipeline. The security model has two layers: JIL permissions (legacy, per-job) and EEM (CA Embedded Entitlements Manager) (modern, role-based).
Every job definition includes a permission attribute that controls access at the job level. Permissions are set as a comma-separated list of access codes.
insert_job: JOB_PAYMENT_EOD job_type: CMD owner: svc_autosys permission: gx,ge,wx,we,mx,me /* Permission codes: g = group, w = world, m = me (owner) x = execute (run/force-start) e = edit (modify JIL definition) r = read (view job definition) gx,ge = group can execute and edit wx,we = world can execute and edit mx,me = owner can execute and edit Production best practice: gx,ge only Never wx or we in production — too permissive */
EEM is the modern RBAC layer. It allows administrators to define roles (operator, developer, admin, read-only) and map them to LDAP/AD groups. This means access is managed centrally via your directory service rather than per-job JIL attributes.
| Role | Typical Permissions | Use Case |
|---|---|---|
| Read-Only | View job status and history only | Business analysts, audit |
| Operator | Force-start, hold, ice, change status | On-call support engineers |
| Developer | Insert/update job definitions in dev/test | Application developers |
| Admin | Full access including delete and security changes | AutoSys team only |
A single Event Server is a single point of failure. If it goes down, all job scheduling stops. For 24/7 enterprise environments, AutoSys supports a Dual Event Server (Primary/Shadow) configuration that provides automatic failover.
Two Event Server instances run simultaneously — a Primary that actively schedules and dispatches, and a Shadow that stays in sync by reading the same database. If the Primary fails, the Shadow detects the absence of heartbeat and promotes itself to Primary automatically, typically within 30-60 seconds.
autoping -m EAS_MACHINE_NAME — include this in your monitoring stackSymptom: Jobs ran twice. Duplicate records in the data warehouse.
Root cause: Network partition between Primary and Shadow. Shadow couldn't see Primary's heartbeat, promoted itself. Network recovered — now both servers thought they were Primary and dispatched the same jobs.
Fix: Immediately shut down one EAS instance. Identify duplicated records via run history timestamps. Implement a fencing mechanism (STONITH or database-based lock) to prevent dual-active scenarios.
Prevention: Use a dedicated heartbeat network interface separate from the data network. Configure the Shadow with a longer promotion timeout to survive brief network blips.
AutoSys r12.x and later supports integrations well beyond shell scripts on bare-metal Linux. In 2026 enterprise environments, AutoSys pipelines routinely trigger REST APIs, run containerised workloads, and dispatch jobs to cloud agents.
The Web Services job type (job_type: WS) allows AutoSys to call REST or SOAP endpoints directly as a job — no wrapper script needed. The job succeeds or fails based on the HTTP response code.
/* Web Services job — calls a REST endpoint */ insert_job: JOB_TRIGGER_RISK_ENGINE job_type: WS box_name: BOX_PAYMENT_EOD web_svc_url: https://risk-engine.internal/api/v2/run-eod web_svc_method: POST web_svc_body: {"run_date":"$$GVAR_RUN_DATE","mode":"full"} web_svc_success_codes: 200,201,202 condition: success(JOB_LOAD_TXN) max_run_alarm: 15
AutoSys supports running Docker containers as jobs via the DOCKER job type or via CMD jobs that invoke docker run or kubectl. The container job type manages the container lifecycle — pull, run, capture exit code, clean up.
/* CMD job wrapping a Docker container run */ insert_job: JOB_RISK_CALC_CONTAINER job_type: CMD command: docker run --rm \ -e RUN_DATE=$$GVAR_RUN_DATE \ -v /data/risk:/data \ registry.internal/risk-calculator:2.1.4 \ --mode full-eod machine: docker-host-01 owner: svc_autosys max_run_alarm: 45 /* Kubernetes job via kubectl */ insert_job: JOB_RISK_CALC_K8S job_type: CMD command: kubectl create job risk-calc-$$GVAR_RUN_DATE \ --from=cronjob/risk-calculator \ --namespace=batch-jobs machine: k8s-bastion-01 max_run_alarm: 60
AutoSys cloud agents run on AWS EC2, Azure VMs, or GCP instances and register back to the on-premises Event Server. From AutoSys's perspective they are just another machine attribute — the job definition is identical. Cloud agents enable hybrid pipelines where on-premises extract jobs feed cloud-based transformation and ML workloads.
kubectl wait --for=condition=complete job/risk-calc-DATE --timeout=3600s in a subsequent CMD job to gate downstream AutoSys jobs on Kubernetes job completion.JIL defines your jobs, but the jil command is what loads those definitions into the AutoSys database. Every engineer needs to know the three ways to run it and how to safely validate before committing.
# Method 1: Redirect a JIL script file (most common in production) jil < payment_eod.jil # Method 2: Interactive mode — type JIL statements directly, Ctrl+D to commit jil # jil> insert_job: JOB_TEST job_type: CMD # jil> command: /opt/test.sh # jil> machine: proc-01 # jil> ^D (Ctrl+D commits to database) # Method 3: Validate syntax WITHOUT committing to database # Always run this before applying in production jil -syntax < payment_eod.jil # If valid: no output and exit code 0 # If invalid: error message with line number # Update an existing job definition update_job: JOB_EXTRACT_TXN max_run_alarm: 90 /* change one attribute, rest unchanged */ # Delete a job delete_job: JOB_OLD_EXTRACT # Delete a box AND all jobs inside it delete_box: BOX_OLD_PIPELINE # Delete a global variable delete_glob: GVAR_DEPRECATED_FLAG # Register a machine in AutoSys topology insert_machine: new-etl-server-01 max_load: 10 factor: 1.00 opsys: LINUX description: "New ETL processing server"
jil -syntax < script.jil before applying to production. The syntax checker validates the entire script without touching the database. A single syntax error in a 200-job JIL file will abort the entire import — leaving you with a partially applied definition. Validate first, always.update_job only changes the attributes you specify. All other attributes remain exactly as they were. This is the safe way to change a single attribute on a live job without risk of accidentally resetting other settings. Use insert_job only when creating a new job or when you deliberately want to reset all attributes.Symptom: Only 40 of 60 jobs in a JIL script were created. The other 20 were missing with no error in the AutoSys logs.
Root cause: A syntax error on line 180 caused the jil command to abort mid-import. Jobs defined after line 180 were never loaded.
Fix: Ran jil -syntax < script.jil, found the error (a missing colon in a condition attribute), fixed it, re-ran the full import. Jobs already created by the partial import needed to be manually deleted first.
Prevention: Always validate with -syntax before importing. In CI/CD pipelines, add jil -syntax as a mandatory gate before any JIL deployment.
AutoSys virtual machines are not VMs in the hypervisor sense — they are logical groups of real agent machines. When a job targets a virtual machine, AutoSys automatically selects which real machine to dispatch it to based on a configurable load-balancing method. This is essential for high-throughput batch environments where hundreds of jobs need to be distributed across a pool of agents.
/* Define a virtual machine containing 3 real agents */ insert_machine: VM_ETL_POOL type: v /* v = virtual machine */ machine_method: ROUNDROBIN /* distribute jobs evenly */ real_machines: etl-proc-01 etl-proc-02 etl-proc-03 description: "ETL processing pool — 3 agents" /* Job targets the virtual machine — AutoSys picks the real agent */ insert_job: JOB_PROCESS_BATCH job_type: CMD command: /opt/etl/process.sh machine: VM_ETL_POOL /* targets pool, not specific machine */ max_run_alarm: 30
| Method | How It Works | Best For |
|---|---|---|
| ROUNDROBIN | Jobs distributed sequentially across all available agents | Uniform job sizes, simple pools |
| CPU_MON | Job sent to agent with lowest current CPU usage | Mixed workloads with variable CPU demand |
| JOB_LOAD | Uses job_load and max_load attributes to track theoretical load | Jobs with known resource weights |
rstatd daemon running on all target machines — if it's not running, AutoSys falls back to CPU_MON silently, which may not distribute as expected. Confirm rstatd status before using CPU_MON in production.The condition attribute supports more status types and patterns than most engineers know. These are the ones that appear in production and in interviews.
/* done() — runs if job is in ANY completed state (SUCCESS, FAILURE, TERMINATED) Use when you want to proceed regardless of how the upstream job ended */ condition: done(JOB_OPTIONAL_CLEANUP) /* notrunning() — runs only if the specified job is NOT currently executing Use to prevent two jobs running on the same resource simultaneously */ condition: notrunning(JOB_DB_MAINTENANCE) /* failure() with lookback — trigger if upstream failed recently Useful for alerting jobs that should only fire on fresh failures */ condition: failure(JOB_PAYMENT_FEED, 01.00) /* Lookback using colon syntax — must escape colon with backslash Both formats are valid: 01.30 and 01\:30 */ condition: success(JOB_RISK_ENGINE, 01\:30) /* Combination — complex real-world condition */ condition: success(JOB_EXTRACT, 02.00) AND notrunning(JOB_DB_BACKUP) AND value(GVAR_MARKET_OPEN) = "Y"
By default AutoSys marks a job FAILURE if it exits with any non-zero code. The max_exit_success attribute lets you define a threshold — any exit code up to and including that value is treated as SUCCESS. Critical for scripts that use exit codes to signal warnings rather than failures.
insert_job: JOB_DATA_VALIDATION job_type: CMD command: /opt/validate/run_checks.sh machine: etl-proc-01 max_exit_success: 4 /* Exit codes 0-4 treated as SUCCESS Exit code 0 = all checks passed Exit codes 1-4 = warnings (some checks failed but acceptable) Exit code 5+ = FAILURE (critical errors) */
Resources are named counters that limit how many jobs can run simultaneously against a shared asset — a database, a file system, an API endpoint. Without resources, AutoSys will dispatch as many concurrent jobs as conditions allow, which can overwhelm downstream systems.
/* Define a resource — max 3 concurrent DB connections */ insert_resource: DB_CONNECTIONS quantity: 3 description: "Max concurrent Oracle DW connections" /* Jobs that consume this resource — each consumes 1 unit */ insert_job: JOB_LOAD_PAYMENTS job_type: CMD command: /opt/etl/load_payments.sh machine: etl-proc-01 resources: DB_CONNECTIONS /* consumes 1 unit */ insert_job: JOB_LOAD_TRADES job_type: CMD command: /opt/etl/load_trades.sh machine: etl-proc-02 resources: DB_CONNECTIONS /* waits if 3 already running */ /* A heavy job consuming multiple units */ insert_job: JOB_BULK_LOAD job_type: CMD command: /opt/etl/bulk_load.sh machine: etl-proc-01 resources: DB_CONNECTIONS(2) /* consumes 2 units — counts as 2 connections */
autorep -r DB_CONNECTIONS -s to check current resource utilisation.The FTP job type (job_type: FTP) transfers files between servers natively without a wrapper script. AutoSys manages the FTP connection, handles authentication, and reports success or failure based on the transfer result. It replaces fragile shell scripts that call ftp or sftp manually.
/* FTP job — download file from remote server */ insert_job: JOB_FTP_GET_PAYMENT job_type: FTP box_name: BOX_PAYMENT_EOD ftp_machine: sftp.partner-bank.com ftp_user: svc_transfer ftp_password: %%ENCRYPTED_PASSWORD%% ftp_src_file: /outgoing/payment_$$GVAR_RUN_DATE.csv ftp_dest_file: /data/incoming/payment_$$GVAR_RUN_DATE.csv ftp_dest_dir: /data/incoming machine: file-drop-01 /* agent that performs the transfer */ description: "Download daily payment file from partner bank" max_run_alarm: 15 /* FTP job — upload results to remote server */ insert_job: JOB_FTP_PUT_REPORT job_type: FTP box_name: BOX_PAYMENT_EOD ftp_machine: reporting.internal ftp_user: svc_reports ftp_src_file: /data/reports/eod_$$GVAR_RUN_DATE.csv ftp_dest_dir: /reports/incoming machine: file-drop-01 condition: success(JOB_GENERATE_REPORT)
%%ENCRYPTED_PASSWORD%% pattern uses the AutoSys credential vault.Calendars in AutoSys define custom sets of dates for job scheduling — business days, trading days, month-end dates, fiscal periods. Instead of hardcoding days_of_week and run_window, calendars let you maintain a central date authority that all jobs reference.
/* Standard calendar — specific dates the job SHOULD run */ insert_calendar: CAL_TRADING_DAYS_2026 datetimes: 01/02/2026 01/05/2026 01/06/2026 01/07/2026 01/08/2026 01/09/2026 /* add all trading days */ description: "NYSE trading days 2026" /* Extended calendar — calculates dates by rule last_business_day: runs on last business day of each month */ insert_calendar: CAL_MONTH_END type: extended definition: "last_business_day" description: "Last business day of each month" /* Exception calendar — dates the job should NOT run Use with run_calendar to exclude holidays */ insert_calendar: CAL_HOLIDAYS_2026 datetimes: 01/01/2026 05/25/2026 07/04/2026 12/25/2026 /* Apply calendar to a box job */ insert_job: BOX_TRADING_EOD job_type: BOX run_calendar: CAL_TRADING_DAYS_2026 exclude_calendar: CAL_HOLIDAYS_2026 start_times: "18:00"
# List all dates in a calendar — critical for month-end debugging autorep -c CAL_TRADING_DAYS_2026 -t # List all defined calendars autorep -c % # Check next scheduled run dates for a box autorep -J BOX_TRADING_EOD -d | grep -i calendar # Forecast when a job will next run (shows upcoming scheduled dates) forecast -J BOX_TRADING_EOD -t 30
Symptom: Month-end reporting jobs didn't run in March. No failures — jobs show INACTIVE the entire day.
Root cause: The calendar CAL_BUSINESS_DAYS_2026 was built from a template that didn't account for March 31 being a Tuesday (valid business day). A data entry error left it off the datetimes list.
Fix: autorep -c CAL_BUSINESS_DAYS_2026 -t confirmed March 31 was missing. Added it with update_calendar JIL, then force-started the box with sendevent -E FORCE_STARTJOB -J BOX_EOM_REPORTS -D 20260331.
Prevention: After building any annual calendar, run autorep -c CALENDAR_NAME -t and manually verify the count of dates matches expected business days for the year.
One of the most common silent failures in AutoSys: a job has a condition referencing a job name that doesn't exist. The condition silently evaluates to false — the job never runs, no error fires, and the pipeline stalls indefinitely. job_depends is the command that catches this.
# Check if all condition references in a job are valid # Reports any job names referenced in conditions that don't exist in AutoSys job_depends -J JOB_LOAD_TXN # Check an entire box and all its children job_depends -J BOX_PAYMENT_EOD # Check all jobs in the system — run this after any large JIL import job_depends -J % # Example output when a dependency is broken: # JOB_LOAD_TXN: condition job JOB_TRANSFORM_V2_TXN not found in database # This means JOB_LOAD_TXN will never start — its condition always false
job_depends -J % after every JIL deployment. When jobs are renamed, deleted, or migrated between environments, condition references can become dangling pointers. job_depends catches all of them in one pass. A job with a broken condition will silently never run — there is no error, no alarm, just an ACTIVATED job that waits forever.Symptom: After a JIL refactor renaming JOB_EXTRACT to JOB_EXTRACT_TXN, 12 downstream jobs stopped running. They showed ACTIVATED indefinitely.
Root cause: The 12 jobs had condition: success(JOB_EXTRACT). The old name no longer existed. Conditions silently evaluated to false.
Fix: job_depends -J % immediately identified all 12 broken references. Updated all conditions to reference JOB_EXTRACT_TXN.
Prevention: Make job_depends -J % part of your deployment runbook. Run it after every JIL change that renames or deletes jobs.
These are the questions that separate AutoSys operators from AutoSys engineers. All are based on real production scenarios.
autorep -J BOX_NAME -d to inspect the box's own condition. Also check if any child job is in a state other than SUCCESS — autorep sometimes truncates output. If the box condition references an external job, check that job's status and either fix it or force its status to SUCCESS. If all conditions are genuinely met, try sendevent -E FORCE_STARTJOB -J BOX_NAME to re-evaluate.condition: success(FAILED_JOB) will automatically queue once the re-run succeeds.n_retrys tells AutoSys to automatically restart a failed job up to N times before marking it as permanently FAILURE. Combined with retry_interval (minutes between retries), it handles transient failures like network timeouts. The key limitation: retries only apply to FAILURE exit codes, not to TERMINATED status (jobs killed via KILLJOB). Also, if the job fails during a box run and the box completes (because other jobs don't depend on this one), retries stop — the box ending resets all job states.autoping -m MACHINE_NAME — if it fails, the agent is unreachable. Fix the agent first (restart cybagent service). Step 2: If autoping succeeds, run autorep -M MACHINE_NAME -r to confirm the machine alias is registered correctly in AutoSys topology. Step 3: SSH to the target machine and verify the owner account exists: id svc_autosys. If the user doesn't exist, the job stays in STARTING indefinitely with no error. Step 4: Check the owner has execute permission on the script path. Step 5: Check EAS logs for dispatch errors.success(JOB_MARKET_FEED, 00.30) means the job only satisfies the condition if it succeeded within the last 30 minutes. Without a lookback, a job that succeeded hours or even days ago still satisfies the condition. Lookbacks are essential for time-sensitive pipelines — market data feeds, regulatory cutoffs, real-time settlement — where stale data from a previous run completing the condition would be dangerous.$$VARNAME syntax in commands or value(VARNAME) in conditions. They're set via sendevent -E SET_GLOBAL or defined in JIL with insert_global. The biggest production risk is persistence — global variables retain their value across box runs. If last night's run set GVAR_PAYMENT_FILE to a specific path and tonight's file watcher fails silently, the variable still holds yesterday's path and downstream jobs will process stale data. Always reset critical variables at the start of each pipeline run.machine attribute references the alias registered in the AutoSys topology database, not the server's actual hostname or IP address. These can be identical, but they don't have to be. You can verify registered machine aliases with autorep -M % -r. Using the wrong value — the actual hostname when the topology alias is different — leaves the job stuck in STARTING indefinitely because the Event Server cannot resolve the target agent. Always confirm the topology alias before defining a new job for a new machine.alarm_if_fail triggers an alarm when a job exits with a non-zero exit code (FAILURE status). alarm_if_terminated triggers an alarm when a job is killed via KILLJOB or by max_run_alarm expiry (TERMINATED status). In production, set both on critical jobs — a job that hangs and gets killed by max_run_alarm will not trigger alarm_if_fail because its status is TERMINATED, not FAILURE. Without alarm_if_terminated, a silently hung-and-killed job can go unnoticed.permission attribute controls who can execute (x) and edit (e) a job. The prefixes are g (group), w (world), and m (me/owner). wx means any user in the system can force-start the job; we means any user can modify the JIL definition. In enterprise environments, wx and we are serious security risks — any developer or operator account could accidentally or maliciously modify or trigger a production settlement job. Best practice is gx,ge only, mapping the group to a controlled AD/LDAP group via EEM.job_type: WS), which is native in AutoSys r12.x+. Define web_svc_url, web_svc_method (GET/POST), web_svc_body for the request payload, and web_svc_success_codes for the HTTP codes that constitute success (typically 200,201,202). The job succeeds or fails based on the response code — no wrapper script needed. Set max_run_alarm to handle hung connections. Global variables can be interpolated into the URL or body using $$GVAR_NAME syntax.run_calendar: BUSINESS_DAYS is used and the last day of the month falls on Saturday, the box has no trigger date for that run. Other causes: the calendar was updated and the change wasn't applied to all environments; the box has a days_of_week restriction that conflicts with the calendar; or the Event Server was down during the scheduled window and the missed run wasn't caught up. Diagnose with autorep -c CALENDAR_NAME -t to see all scheduled dates.std_out_file captures the script's stdout on the remote agent machine — whatever the shell script prints to standard output. The job log in the AutoSys database captures the job's lifecycle events: when it was dispatched, which agent received it, the exit code, start and end timestamps. Both are essential for debugging: the job log tells you what AutoSys did, the std_out_file tells you what the script did. If std_out_file is not set, stdout is lost when the job completes. Use $DATE in the filename to preserve one log per run rather than overwriting.autorep -J JOB_NAME -q — the -q flag outputs the job definition in JIL format that can be piped directly to the jil command on another instance. This is the standard way to copy jobs between environments (dev → test → prod), create backups before making changes, or document existing job definitions. For a full box including all children: autorep -J BOX_NAME -q exports the box and every job inside it.autorep -G GVAR_NAME to see the current value of the variable. If it's empty or set to an unexpected value, that's why the condition fails. Also run autorep -J JOB_NAME -d to inspect the job's condition attribute and confirm exactly what value the condition is checking. Common scenario: a job has condition: value(GVAR_RUN_DATE) != "" but the variable was never set because an earlier job that calls SET_GLOBAL failed. Fix by manually setting: sendevent -E SET_GLOBAL -G GVAR_RUN_DATE -V "20260417", then force-starting the blocked job.min_file_size sets the minimum file size in bytes that the watched file must reach before the FW job considers it a success. Setting it to 1 prevents the job from triggering on an empty file — a common failure mode in file-based pipelines where the upstream system creates the file immediately but writes data to it over time. Without min_file_size, the FW job succeeds the instant the file appears (even with 0 bytes), the downstream CMD job starts, and attempts to process an empty file. This causes subtle failures that are hard to diagnose because the file exists but contains no data.-D flag with sendevent: sendevent -E FORCE_STARTJOB -J BOX_NAME -D YYYYMMDD. This triggers the box as if it were running on the specified date, which is critical for date-aware jobs that use $$DATE or date-based global variables. Without the -D flag, FORCE_STARTJOB runs the box with today's date, which would cause the jobs to process the wrong data set. Always confirm the date format matches your AutoSys configuration — some environments use MMDDYYYY.permission: gx,ge) are per-job access controls defined in the job definition itself — they control who can execute or edit that specific job based on OS group membership. EEM (Embedded Entitlements Manager) is the centralized RBAC layer that maps roles (Operator, Developer, Admin) to LDAP/AD groups across all jobs in the instance. EEM supersedes JIL permissions in modern AutoSys deployments — it allows consistent access control without touching individual job definitions. Use EEM when you need enterprise-wide role enforcement; use JIL permissions as a secondary layer for job-specific restrictions.