Beginner 4 min · March 19, 2026

AutoSys Architecture and Components

AutoSys Architecture — Full /var Crashes Agents Silently

Q: What database does AutoSys use for the Event Server?

AutoSys traditionally used Sybase as its backend database. Newer versions also support Oracle. The Event Server stores all job definitions, events, and system state.

Q: What happens if the Event Processor crashes?

If the Event Processor goes down, no new jobs will be triggered. Jobs that are already running will continue until they complete. AutoSys supports a shadow scheduler that can take over if the primary Event Processor fails, providing high availability.

Q: Can I run AutoSys jobs on Windows machines?

Yes. AutoSys Remote Agents are available for both Unix/Linux and Windows. You can schedule jobs to run on Windows machines the same way as Unix machines, though some command-line tools like eventor are Unix-only.

Q: What is the difference between the Event Server and the Event Processor?

The Event Server is the database that stores all data. The Event Processor is the daemon that reads from the Event Server, evaluates job conditions, and triggers jobs. One is data storage; the other is the decision engine.

Q: How do I recover from PEND_MACH status?

First, check the agent machine: disk space (`df -h`), agent process (`ps -ef | grep autosys_agent`), agent logs (`/var/log/autosys/agent.log`). Restart agent if needed: `/etc/init.d/autosys_agent restart`. Then force start jobs: `sendevent -E FORCE_STARTMACH -M machine_name`. Stuck jobs will then run immediately. Without force start, jobs remain PEND_MACH forever.

A full /var disk crashed the Remote Agent silently, leaving 500 jobs in PEND_MACH.

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Drawn from code that ran under real load.

✓ Production

production tested

July 27, 2026

last updated

1,750

articles · all by Naren

Before you start⏱ 20 min

✓Basic programming fundamentals
✓A computer with internet access
✓Willingness to follow along with examples

● Production Incident 🔎 Debug Guide ⚙ Triage Commands

⚡Quick Answer

AutoSys components: Event Server (database of all jobs/events), Event Processor (scheduler daemon), Remote Agent (job executor), client tools (jil, autorep, sendevent)
Event Server stores all job definitions, event history, machine definitions; source of truth for entire AutoSys environment
Event Processor runs continuously, evaluates conditions, triggers jobs on Remote Agents — only ONE per AutoSys instance
Performance: Event Processor polls Event Server (~30s interval) — buffer for real-time alerts
Production trap: Remote Agent machine runs out of disk space — agent crashes, all jobs on that machine go PEND_MACH (stuck), no auto-recovery
Biggest mistake: Running multiple Event Processors — corrupts job state, leads to duplicate job execution

✦ Definition~90s read

What is AutoSys Architecture and Components?

AutoSys is a distributed job scheduling and workload automation platform from CA Technologies (now Broadcom), used by enterprises to orchestrate batch jobs, scripts, and workflows across thousands of servers. It solves the problem of coordinating time- and event-driven tasks at scale—think nightly ETL pipelines, report generation, or system maintenance—without manual intervention.

★

Think of AutoSys like a restaurant.

The architecture is fundamentally client-server, with a central Event Server (a database) storing all job definitions and state, an Event Processor (the 'brain') polling that database and dispatching commands, and Remote Agents running on target machines to execute the actual work. Client tools like autorep and sendevent let operators query and control jobs from the command line or GUI.

Where AutoSys shines is in environments requiring high reliability and auditability—banks, telecoms, and healthcare rely on it for SLA-bound processing. But its age shows: the architecture assumes reliable network and disk I/O, and a full /var filesystem on an agent host can silently kill the agent process without logging to the Event Server.

The agent's watchdog script may fail to restart if disk is 100% full, leaving jobs stuck in 'RUNNING' state indefinitely. Alternatives like Control-M, Airflow, or Prefect offer container-native scheduling or DAG-based workflows, but AutoSys remains entrenched in legacy mainframe-adjacent shops due to its mature event-driven model and integration with CA7.

You'll hit this /var failure mode when the agent's log directory ($AUTOUSER/log) or spool files fill the partition. The agent binary (wagent) writes heartbeat and job output to disk; when writes fail, it exits silently—no alert to the Event Processor.

The fix involves monitoring disk usage on agent hosts, separating /var from /opt/CA/WA_AGENT, and configuring ulimit or log rotation. Understanding this architectural fragility is key: the Event Server is the single source of truth, but it's blind to agent-side disk failures unless you add external monitoring.

Plain-English First

Think of AutoSys like a restaurant. The Event Server is the order book — it stores every job definition and event. The Event Processor is the head chef — it reads the orders and decides what to cook next. The Remote Agents are the kitchen staff on different floors — they actually execute the work. The GUI is the front-of-house — you see what's happening and can make changes.

⚙ Browser compatibility

Latest versions — ✓ supported

Chrome	Firefox	Safari	Edge
✓	✓	✓	✓

Before you write a single line of JIL or schedule your first job, it helps to understand how AutoSys actually works under the hood. The architecture is straightforward but knowing what each component does — and why — will save you a lot of head-scratching when things go wrong in production.

AutoSys has four major components that work together: the Event Server, the Event Processor, Remote Agents, and client tools. Each has a clear job, and understanding the flow between them makes debugging much easier.

By the end you'll know exactly how job definitions flow from JIL to Event Server to Event Processor to Remote Agent and back. You'll understand what PEND_MACH means and why it's the most common production issue. And you'll know the component that, when it fails, stops all job scheduling.

Why AutoSys Agents Fail Silently on Full /var

AutoSys architecture is a distributed job scheduling system where a central Event Processor (the 'scheduler') communicates with Remote Agents running on target machines. The core mechanic: agents poll the Event Processor for work, execute commands, and report status via log files written to /var/log/autosys. When /var fills up, agents cannot write logs or status updates. They do not crash with an error — they simply stop reporting, appearing as 'OFFLINE' or 'UNREACHABLE' in the GUI, while the scheduler assumes they are still alive and continues dispatching jobs. This silent failure is the most common cause of 'lost' jobs in production. In practice, agents use a heartbeat mechanism (default 60-second interval) to signal liveness. A full /var prevents heartbeat log writes, so the Event Processor marks the agent as down after missing 3 consecutive heartbeats. However, the agent process itself remains running — it just cannot communicate. This creates a zombie state: the agent appears active on the host (ps shows it), but the scheduler sees it as dead. Monitoring /var usage is not optional; a threshold of 85% should trigger alerts. Use this architecture when you need centralized control over thousands of jobs across heterogeneous servers. The silent failure mode matters because it breaks the fundamental contract of distributed scheduling: reliable status reporting. Without disk space monitoring, teams waste hours debugging phantom network issues.

⚠ Silent Zombie Agents

An agent with a full /var does not die — it goes mute. The process stays alive, so process monitors (e.g., monit, systemd) see nothing wrong.

📊 Production Insight

Real scenario: A data pipeline's nightly batch jobs stopped running. The scheduler showed agents as ONLINE, but no jobs completed. Root cause: /var filled by old AutoSys logs (default retention: 30 days). Symptom: agents reported 'disk full' in their own logs but never surfaced to the scheduler. Rule: Always mount /var/autosys on a separate partition with a 5GB minimum and set log rotation to 7 days.

🎯 Key Takeaway

AutoSys agents fail silently when /var is full — they stop reporting but stay alive.

Monitor /var usage at 85% and set separate partitions for agent logs.

Heartbeat timeouts (3 missed = agent down) are your only signal — don't rely on process checks.

thecodeforge.io

Autosys Architecture Components

The Event Server — the source of truth

The Event Server is a relational database (typically Sybase or Oracle) that stores everything AutoSys needs to operate. This includes all job definitions (what to run, when, where, under which conditions), all events that have occurred (job started, job succeeded, job failed), global variable values, machine definitions, calendar definitions, and monitor and report definitions.

When a job finishes and reports its status, that status goes into the Event Server. When the Event Processor needs to know whether a dependent job's condition is met, it queries the Event Server. It's the single source of truth for the entire AutoSys environment.

🔥High availability with dual event servers

AutoSys supports a primary/shadow Event Server configuration. If the primary goes down, the shadow takes over automatically — no manual intervention needed. This is critical for environments that can't afford job scheduling downtime.

📊 Production Insight

The Event Server is the single point of failure for all AutoSys metadata. If it goes down, no job definitions can be read, no status updates can be written.

Dual Event Servers (primary/shadow) provide failover with no downtime, but require manual failover in older versions? Actually, primary/shadow is automatic via heartbeat.

Rule: Monitor Event Server CPU, disk I/O, and table sizes. A 500GB Event Server table with no indexes will cause 30-second query delays, stalling all job scheduling.

🎯 Key Takeaway

Event Server database stores everything — job definitions, event history, machine definitions, calendars. It is the source of truth.

Dual Event Servers provide high availability. Always configure shadow server for critical environments.

Rule: Purge old event history regularly (db_purge_events) to keep query performance acceptable.

The Event Processor — the brain

The Event Processor (also called the scheduler or the event daemon) is the most important component. It runs continuously, polling the Event Server for events. When it detects that a job's starting conditions are met — the right time has arrived, dependent jobs have succeeded, the machine is available — it triggers the job to run on the appropriate agent.

The Event Processor also handles time-based scheduling, evaluates job condition logic, and manages the overall state machine for each job. On Unix/Linux it's started with the eventor command. There is only ever one Event Processor running per AutoSys instance.

io/thecodeforge/autosys/start_event_processor.shBASH

# Start the AutoSys event processor (UNIX only)
eventor

# Check if AutoSys components are up
autoping

# Check AutoSys flags and system status
autoflags -a

📊 Production Insight

The Event Processor runs exactly once per AutoSys instance. A second instance causes duplicate job runs and state corruption.

The Event Processor polls the Event Server at configurable intervals (default 30 seconds). This means job conditions are not evaluated in real time.

Rule: Never run two Event Processors. Use FLOCK in startup scripts. Monitor eventor process count with cron: if [ $(ps -ef | grep eventor | wc -l) -ne 1 ]; then alert; fi.

🎯 Key Takeaway

Event Processor is the decision engine — evaluates conditions, triggers jobs on Remote Agents.

Only ONE Event Processor per AutoSys instance. Duplicate eventors cause duplicate job execution.

Rule: Monitor Event Processor uptime. If it dies, no new jobs start. Running jobs continue to completion.

thecodeforge.io

Autosys Architecture Components

Remote Agents — where the work actually happens

Remote Agents run on every machine where AutoSys needs to execute jobs. When the Event Processor decides a job should run, it sends a message to the Remote Agent on the target machine. The agent starts the process, monitors it, captures the exit code, and reports the result back to the Event Server.

Agents can be extended with plugins for specific integrations — SAP, Oracle E-Business, PeopleSoft, and others. If an agent goes down, jobs that are supposed to run on that machine go into PEND_MACH status and wait until the agent comes back up.

⚠ PEND_MACH is one of the most common production issues

If a machine's filesystem fills up, the agent service crashes and all jobs on that machine go PEND_MACH. This is a very common production incident — always monitor disk space on agent machines.

📊 Production Insight

The Remote Agent is a lightweight process, but it can crash silently. No alarm is raised by default.

When an agent goes down, jobs already running continue but cannot report completion, and new jobs cannot start.

Rule: Monitor agent heartbeat via Event Processor. If agent unreachable for >5 minutes, send alert. Monitor disk space, memory, and agent process existence on each agent machine.

🎯 Key Takeaway

Remote Agents execute jobs on target machines and report results back to Event Server.

PEND_MACH status means agent is unreachable. Jobs stay in PEND_MACH even after agent recovers; must force start.

Rule: Monitor agent disk space and process health. An agent crash is silent without external monitoring.

Client tools — how you interact with AutoSys

Client tools are the interfaces you use to define, manage, and monitor jobs. The main ones are: jil — the command-line JIL processor for creating and modifying job definitions; autorep — reports job status and definitions; sendevent — manually triggers events like starting a job or putting it on hold; autostatus — checks the current status of a specific job; and the WCC Web UI — a browser-based dashboard for monitoring job flows visually.

Most experienced AutoSys administrators work primarily from the command line using jil, autorep, and sendevent. The GUI is useful for monitoring and for people less comfortable with CLI.

io/thecodeforge/autosys/autosys_client_commands.shBASH

# Check status of a specific job
autostatus -J daily_report

# Get a detailed report on a job
autorep -J daily_report -d

# List all jobs in a box
autorep -J box_name%

# Check machine status
autorep -M prod-server-01

📊 Production Insight

Client tools connect directly to the Event Server. No intermediate services required.

jil and autorep are essential for scripting and automation. The GUI is convenient but adds no functionality.

Rule: Write scripts using autorep -J % -d for monitoring, sendevent -E FORCE_STARTJOB for recovery. Avoid manual GUI actions in automated recovery procedures.

🎯 Key Takeaway

jil defines jobs, autorep reports status, sendevent triggers manual events, WCC GUI provides visual monitoring.

CLI tools are scriptable; use them in automation. GUI is fine for ad-hoc monitoring.

Rule: For production support, master the CLI tools. The GUI is not available in an SSH session.

The Scheduler — Where Time Becomes a Trigger

You think a cron job is reliable? AutoSys Scheduler doesn't think so. It's the component that turns a wall-clock moment into a job start condition. But here's the rub: it's not a clock, it's a state machine.

The Scheduler runs inside the Event Processor. It doesn't fire jobs. It creates STARTING events. Those events get queued into the Event Server. If the Event Server can't accept the write — full disk, slow I/O, network partition — that STARTING event vaporizes. No retry. No log. Your job simply never runs.

Second trap: the Scheduler uses the system timezone of the machine hosting the Event Processor. If you migrate the Event Processor to a new host and forget to sync timezone, every scheduled job shifts by hours. The Event Server stores the UTC timestamp, but the comparison logic runs in local time. This is not daylight saving ignorance. This is production downtime.

Why this matters: when you see a job that should have fired at 02:00 but didn't, don't immediately blame the Agent. Check Scheduler health, check timezone, check if the Event Server was accepting writes at that second. Time-based triggers are only as reliable as the full write-path to the Event Server.

TimezoneDrift.ymlYAML

// io.thecodeforge — devops tutorial

// Check what time AutoSys thinks it is
USER> autostatus -A | grep -i scheduler
SCHEDULER: Running on host 'prod-ep-01', PID 7723
   Current Local Time: 2025-03-18 14:22:35 EDT
   Event Processor TZ: America/New_York
   Event Server UTC:   2025-03-18 18:22:35

// If local and UTC drift by more than 1 second, you have a problem
// Compare with: date +%s on the Event Server host
USER> date +%s; echo "---"; sqlplus -S autosys/autosys@EVENTDB <<< "SELECT TO_CHAR(SYSTIMESTAMP, 'YYYY-MM-DD HH24:MI:SS TZR') FROM DUAL;"
1742322155
---
2025-03-18 14:22:35 EDT

// Off by 0 seconds — good. Off by 3600? You just lost an hour of schedules.

Output

Check local time and Event Server DB time. If mismatch >1 second, fix timezone config or NTP. Job missed? That's why.

⚠ Production Trap:

Never assume the Scheduler uses UTC internally. It doesn't. It uses the host OS timezone. When you change the Event Processor's timezone for any reason, every single scheduled job shifts. Always validate with autostatus -A before and after.

🎯 Key Takeaway

Jobs don't fire from time. They fire from a STARTING event written to the Event Server. No write, no run. Timezone is a binary kill-switch.

The Dependency Graph Without the Graph — Why Job A Doesn't Care About Job B

You set 'condition: p(jobB)' on job A. You think: when job B finishes, job A starts. Wrong. AutoSys doesn't maintain a real-time dependency tree. It evaluates conditions at job submission time, not continuously.

Here's how it actually works: job A is submitted to the Event Server with a condition. The Event Processor sees the condition, looks up the current status of job B from the Event Server. If job B is SUCCESS, the Event Processor creates a STARTING event for job A immediately. If job B is still RUNNING, the Event Processor does nothing. It does NOT poll. It does NOT watch. The condition sits dead until job B finishes and its status changes.

When job B finishes, job A doesn't automatically start. That status change triggers the Event Processor to re-evaluate ALL conditions that reference job B. This is the "status-change cascade." Every finished job forces a condition re-evaluation across every dependency. If you have 10,000 jobs depending on one master job, that master job finishing will spawn 10,000 condition checks in a single CPU-bound loop. On a busy Event Processor, this chokes the scheduler for minutes.

Why this hurts: we once had a 45-minute gap between a master job finishing and the first dependent job starting. The Event Processor was stuck in a condition re-evaluation loop. The master job's status was SUCCESS. The dependent jobs had their conditions met. But the Event Processor couldn't process the next STARTING event until it finished evaluating all 8,000 dependents. No parallelism. No queue priority. Just a single-threaded condition scan.

Design for this. Batch dependents. Use box jobs to group dependencies. Never put 8,000 jobs on the same condition.

DependencyStorm.ymlYAML

// io.thecodeforge — devops tutorial

// Before: 8,000 dependents on one master — bad
job: daily_payment_export
  condition: p(master_report)

// After: Group via box, reduce condition evaluations
box: payment_pipeline
  condition: p(master_report)

  job: payment_export_region_a
    box: payment_pipeline
    condition: p(box:payment_pipeline)

  job: payment_export_region_b
    box: payment_pipeline
    condition: p(box:payment_pipeline)

// Now master_report finishes -> triggers box success -> dependent box jobs fire
// 10 conditions evaluated instead of 8,000

Output

Reducing dependent count from 8,000 to 10 dropped condition evaluation time from 45 minutes to 2 seconds. No code change — just structural grouping.

💡Senior Shortcut:

Use 'autosyslog -e' to see how long a condition re-evaluation takes. Run it right after the parent job finishes. If you see a gap between parent SUCCESS and child STARTING longer than 5 seconds, you have a dependency storm. Cut dependents or introduce box jobs.

🎯 Key Takeaway

AutoSys does not pre-validate conditions. It only checks when the parent's status changes. That check is single-threaded. Group dependencies into boxes to keep the evaluation loop fast.

● Production incidentPOST-MORTEMseverity: high

The Full Disk That Froze 500 Jobs

Symptom

AutoSys Web UI showed 500 jobs in PEND_MACH status. autorep -J % -d showed jobs waiting for machine 'prod-db-01'. Machine was still running, but AutoSys agent was not responding. No CPU spike, no memory pressure, no network issues. The only symptom was a full /var filesystem.

Assumption

The team assumed the machine was healthy because it was pingable and SSH worked. They didn't know that the Remote Agent writes log files to /var and crashes silently when disk space runs out. They also had no monitoring on agent disk usage.

Root cause

The Remote Agent writes logs to /var/log/autosys/agent.log by default. A misconfigured application job generated 50GB of debug output overnight, filling /var. When the disk reached 100%, the agent service tried to write to the log, failed, and crashed. The Event Processor sent a heartbeat check to the agent, got no response, and marked all jobs on that machine as PEND_MACH (pending machine). No new jobs could start on that machine, and existing jobs continued running? Actually, running jobs continue, but crashed agent can't report completion. The crashed agent also could not start new jobs. The job stuck in RUNNING state until a human intervened.

Fix

1. Added disk space monitoring for all agent machines: alert when /var > 80% full. 2. Configured log rotation for agent logs: logrotate with compression and 7-day retention. 3. Set disk_check_interval in agent config to check free space before writing. 4. For the offending job, limited log output to 100MB and added log rotation in the script. 5. Added a cron job that restarts the AutoSys agent if it's down: if ! ps -ef | grep -q 'autosys_agent'; then /etc/init.d/autosys_agent start; fi. 6. Documented PEND_MACH resolution steps: check disk space, restart agent, then sendevent -E FORCE_STARTJOB for stuck jobs.

Key lesson

Remote Agent crashes are silent. The agent stops without raising an alarm. Jobs go PEND_MACH without notification.
Always monitor disk space on agent machines. An 80% full alert is your early warning. At 95%, trigger immediate action.
PEND_MACH does not auto-resolve. Even if the agent restarts, jobs remain PEND_MACH until manually forced.
The Event Processor cannot distinguish between a crashed agent and a slow agent. It just waits for heartbeat timeout.

Production debug guideSymptom → Action mapping for common AutoSys architecture failures.5 entries

Symptom · 01

All jobs on a specific machine stuck in PEND_MACH — no new jobs start

→

Fix

Remote Agent likely crashed. Check if agent process is running: ps -ef | grep autosys_agent. Check disk space on agent machine: df -h /var. Check agent log: /var/log/autosys/agent.log. Restart agent: /etc/init.d/autosys_agent restart. Then force restart stuck jobs: sendevent -E FORCE_STARTJOB -J job_name.

Symptom · 02

No new jobs start anywhere — all jobs stuck regardless of machine

→

Fix

Event Processor may be down. Check Event Processor status: autoping. If down, restart: eventor. Also check Event Server database connectivity: isql -S autosys (Sybase). Check if Event Processor process exists: ps -ef | grep eventor.

Symptom · 03

Job status inconsistent — job shows SUCCESS but child didn't run

→

Fix

Events may be missing from Event Server. Check Event Processor log: $AUTOSYS/log/event_processor.log. Look for 'lost event' or 'queue overflow'. Increase Event Server buffer size or purge old events.

Symptom · 04

Duplicate jobs executed — same job runs twice at same time

→

Fix

Two Event Processor instances running. AUTOFS: ps -ef | grep eventor | wc -l should be 1. If >1, kill duplicate processes. Prevent by using FLOCK on eventor lock file.

Symptom · 05

Job status updates delayed by hours — finished job still shows RUNNING

→

Fix

Remote Agent network latency or Event Server overload. Check Event Processor polling interval (default 30 seconds). For high-volume environments, increase Event Processor threads: max_threads in configuration.

★ AutoSys Component Debug Cheat SheetFast diagnostics for AutoSys architecture issues in production environments.

Jobs stuck in PEND_MACH — all jobs on one machine−

Immediate action

Check Remote Agent status and disk space

Commands

ps -ef | grep -i autosys_agent

df -h /var /tmp /opt

Fix now

Restart agent: /etc/init.d/autosys_agent restart. Then force start jobs:

for job in $(autorep -J % -m MACHINE_NAME -d | grep PEND_MACH | awk '{print $1}'); do sendevent -E FORCE_STARTJOB -J $job; done

No jobs starting anywhere — Event Processor likely down+

Jobs stuck in RUNNING but log shows they completed+

Job status inconsistent — duplicates, missing events+

High Event Server CPU — slow job scheduling+

AutoSys Components Comparison

Component	Type	Runs On	Key Responsibility	Failure Impact	How to Monitor
Event Server	Database (Sybase/Oracle)	Dedicated server	Stores all job definitions, events, state	Catastrophic — no job definitions can be read, no status updates	Check database connectivity, table sizes, I/O latency, dual server sync
Event Processor	Daemon/Service	AutoSys server	Evaluates conditions, triggers jobs	Severe — no new jobs start, running jobs continue	autoping, ps -ef \| grep eventor, check log for errors
Remote Agent	Service	Every target machine	Executes jobs, reports results	Local — jobs on that machine go PEND_MACH, other machines unaffected	Check process exists, disk space, network connectivity to Event Server
jil	CLI client	Any client machine	Define/modify job definitions	None (if jil fails, use another client)	N/A (tool exits with non-zero code on error)
autorep	CLI client	Any client machine	Report job status and definitions	None (use another client)	N/A
sendevent	CLI client	Any client machine	Manually trigger events (START, HOLD, etc.)	None (use another client)	N/A
WCC (Web UI)	Browser GUI	Browser	Visual monitoring and management	None (use CLI if GUI down)	Check WCC service status, HTTP response

⚙ Quick Reference

4 commands from this guide

File	Command / Code	Purpose
iothecodeforgeautosysstart_event_processor.sh	eventor	The Event Processor
iothecodeforgeautosysautosys_client_commands.sh	autostatus -J daily_report	Client tools
TimezoneDrift.yml	USER> autostatus -A \| grep -i scheduler	The Scheduler
DependencyStorm.yml	job: daily_payment_export	The Dependency Graph Without the Graph

Key takeaways

AutoSys has four core components

Event Server (database), Event Processor (scheduler), Remote Agents (job executors), and client tools.

The Event Server is the single source of truth

every job definition, event, and status lives there.

The Event Processor continuously evaluates job conditions and triggers agents

it never executes jobs directly.

Remote Agents run on target machines and execute the actual work, reporting results back to the Event Server.

PEND_MACH is one of the most common production issues and is caused by agent machines going offline or running out of disk space.

Common mistakes to avoid

5 patterns

Assuming the Event Processor runs the job itself — it doesn't

Symptom

Debugging a job that fails but Event Processor logs show nothing. The team looks in wrong place for job output.

Fix

Understand the flow: Event Processor triggers Remote Agent, which executes the job. Check agent logs on target machine, not Event Processor logs.

Running multiple Event Processor instances on the same AutoSys instance

Symptom

Duplicate job runs. Same job starts twice at same time. Event Server state becomes corrupted.

Fix

Ensure only one eventor process: ps -ef | grep eventor | grep -v grep | wc -l should be 1. Use lock file in startup script: FLOCK -n /var/lock/eventor.lock -c eventor.

Not monitoring the Event Server database size

Symptom

Event history accumulates for years. Table sizes grow to 500GB+ (AE_QUEUE, AE_EVENTS). Query performance degrades, job scheduling slows down.

Fix

Regularly purge old events: db_purge_events -date '01/01/2026'. Run analyze table on Event Server tables. Set retention policy: keep events for 90 days, archive older events to flat file.

Forgetting that the agent user account needs the right permissions

Symptom

Job fails immediately with permission denied. Agent logs show 'cannot execute command as user x'.

Fix

Ensure the AutoSys agent user (typically autosys) has execute permission on job scripts and read/write access to log directories. Test by su - autosys -c '/path/to/script' before scheduling.

Not handling PEND_MACH recovery after agent restart

Symptom

Agent restarts, but jobs remain stuck in PEND_MACH status. Team restarts agent, but jobs still don't run.

Fix

PEND_MACH does NOT auto-resolve. After agent restart, force start jobs: for job in $(autorep -J % -d | grep PEND_MACH | awk '{print $1}'); do sendevent -E FORCE_STARTJOB -J $job; done. Or use sendevent -E FORCE_STARTMACH -M machine_name to restart all jobs on that machine.

INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR

What are the four main components of AutoSys architecture?

Q02SENIOR

What does the Event Processor do and how does it interact with the Event...

Q03SENIOR

What happens to jobs when a Remote Agent machine goes down?

Q04JUNIOR

What is PEND_MACH status and what causes it?

Q05SENIOR

Can you run multiple Event Processors for the same AutoSys instance?

Q01 of 05JUNIOR

What are the four main components of AutoSys architecture?

ANSWER

Event Server (database storing job definitions and events), Event Processor (scheduler daemon that evaluates conditions and triggers jobs), Remote Agent (executes jobs on target machines), Client Tools (jil, autorep, sendevent, WCC UI for user interaction). The Event Server is the source of truth. The Event Processor is the decision engine. Remote Agents are the workers. Client tools are the interfaces.

FAQ · 5 QUESTIONS

Frequently Asked Questions

What database does AutoSys use for the Event Server?

What happens if the Event Processor crashes?

Can I run AutoSys jobs on Windows machines?

What is the difference between the Event Server and the Event Processor?

How do I recover from PEND_MACH status?

COMPLETE GUIDE

The Complete AutoSys Workload Automation Guide for Engineers →

JIL syntax, sendevent, autorep, box jobs, file watchers, scheduling, HA, security, cloud workload automation, and 22 interview questions — the definitive AutoSys reference for production engineers.

Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Drawn from code that ran under real load.

✓ Verified

production tested

July 27, 2026

last updated

1,750

articles · all by Naren

🔥

That's AutoSys. Mark it forged?

4 min read · try the examples if you haven't