Senior 6 min · March 19, 2026

AutoSys Architecture — Full /var Crashes Agents Silently

A full /var disk crashed the Remote Agent silently, leaving 500 jobs in PEND_MACH.

N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Drawn from code that ran under real load.

Follow
Production
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
 ● Production Incident 🔎 Debug Guide ⚙ Triage Commands
Quick Answer
  • AutoSys components: Event Server (database of all jobs/events), Event Processor (scheduler daemon), Remote Agent (job executor), client tools (jil, autorep, sendevent)
  • Event Server stores all job definitions, event history, machine definitions; source of truth for entire AutoSys environment
  • Event Processor runs continuously, evaluates conditions, triggers jobs on Remote Agents — only ONE per AutoSys instance
  • Performance: Event Processor polls Event Server (~30s interval) — buffer for real-time alerts
  • Production trap: Remote Agent machine runs out of disk space — agent crashes, all jobs on that machine go PEND_MACH (stuck), no auto-recovery
  • Biggest mistake: Running multiple Event Processors — corrupts job state, leads to duplicate job execution
✦ Definition~90s read
What is AutoSys Architecture and Components?

AutoSys is a distributed job scheduling and workload automation platform from CA Technologies (now Broadcom), used by enterprises to orchestrate batch jobs, scripts, and workflows across thousands of servers. It solves the problem of coordinating time- and event-driven tasks at scale—think nightly ETL pipelines, report generation, or system maintenance—without manual intervention.

Think of AutoSys like a restaurant.

The architecture is fundamentally client-server, with a central Event Server (a database) storing all job definitions and state, an Event Processor (the 'brain') polling that database and dispatching commands, and Remote Agents running on target machines to execute the actual work. Client tools like autorep and sendevent let operators query and control jobs from the command line or GUI.

Where AutoSys shines is in environments requiring high reliability and auditability—banks, telecoms, and healthcare rely on it for SLA-bound processing. But its age shows: the architecture assumes reliable network and disk I/O, and a full /var filesystem on an agent host can silently kill the agent process without logging to the Event Server.

The agent's watchdog script may fail to restart if disk is 100% full, leaving jobs stuck in 'RUNNING' state indefinitely. Alternatives like Control-M, Airflow, or Prefect offer container-native scheduling or DAG-based workflows, but AutoSys remains entrenched in legacy mainframe-adjacent shops due to its mature event-driven model and integration with CA7.

You'll hit this /var failure mode when the agent's log directory ($AUTOUSER/log) or spool files fill the partition. The agent binary (wagent) writes heartbeat and job output to disk; when writes fail, it exits silently—no alert to the Event Processor.

The fix involves monitoring disk usage on agent hosts, separating /var from /opt/CA/WA_AGENT, and configuring ulimit or log rotation. Understanding this architectural fragility is key: the Event Server is the single source of truth, but it's blind to agent-side disk failures unless you add external monitoring.

Plain-English First

Think of AutoSys like a restaurant. The Event Server is the order book — it stores every job definition and event. The Event Processor is the head chef — it reads the orders and decides what to cook next. The Remote Agents are the kitchen staff on different floors — they actually execute the work. The GUI is the front-of-house — you see what's happening and can make changes.

Before you write a single line of JIL or schedule your first job, it helps to understand how AutoSys actually works under the hood. The architecture is straightforward but knowing what each component does — and why — will save you a lot of head-scratching when things go wrong in production.

AutoSys has four major components that work together: the Event Server, the Event Processor, Remote Agents, and client tools. Each has a clear job, and understanding the flow between them makes debugging much easier.

By the end you'll know exactly how job definitions flow from JIL to Event Server to Event Processor to Remote Agent and back. You'll understand what PEND_MACH means and why it's the most common production issue. And you'll know the component that, when it fails, stops all job scheduling.

Why AutoSys Agents Fail Silently on Full /var

AutoSys architecture is a distributed job scheduling system where a central Event Processor (the 'scheduler') communicates with Remote Agents running on target machines. The core mechanic: agents poll the Event Processor for work, execute commands, and report status via log files written to /var/log/autosys. When /var fills up, agents cannot write logs or status updates. They do not crash with an error — they simply stop reporting, appearing as 'OFFLINE' or 'UNREACHABLE' in the GUI, while the scheduler assumes they are still alive and continues dispatching jobs. This silent failure is the most common cause of 'lost' jobs in production. In practice, agents use a heartbeat mechanism (default 60-second interval) to signal liveness. A full /var prevents heartbeat log writes, so the Event Processor marks the agent as down after missing 3 consecutive heartbeats. However, the agent process itself remains running — it just cannot communicate. This creates a zombie state: the agent appears active on the host (ps shows it), but the scheduler sees it as dead. Monitoring /var usage is not optional; a threshold of 85% should trigger alerts. Use this architecture when you need centralized control over thousands of jobs across heterogeneous servers. The silent failure mode matters because it breaks the fundamental contract of distributed scheduling: reliable status reporting. Without disk space monitoring, teams waste hours debugging phantom network issues.

Silent Zombie Agents
An agent with a full /var does not die — it goes mute. The process stays alive, so process monitors (e.g., monit, systemd) see nothing wrong.
Production Insight
Real scenario: A data pipeline's nightly batch jobs stopped running. The scheduler showed agents as ONLINE, but no jobs completed. Root cause: /var filled by old AutoSys logs (default retention: 30 days). Symptom: agents reported 'disk full' in their own logs but never surfaced to the scheduler. Rule: Always mount /var/autosys on a separate partition with a 5GB minimum and set log rotation to 7 days.
Key Takeaway
AutoSys agents fail silently when /var is full — they stop reporting but stay alive.
Monitor /var usage at 85% and set separate partitions for agent logs.
Heartbeat timeouts (3 missed = agent down) are your only signal — don't rely on process checks.
AutoSys Architecture — Component Interaction Diagram AutoSys component interaction showing Windows Agent and UNIX Agent on sides, connected to central CA Workload Automation AE Server containing Scheduler, Event Server (Database), Web Server, and Application Server. Client machines connect bidirectionally to Application Server. THECODEFORGE.IO AutoSys Architecture — Component Interaction How agents, scheduler, database and clients connect Windows Agent 👤 Agent autosys_agent port 7520 🪟 Windows Job CMD / PowerShell 💻 Windows Client jil / autorep sendevent / SDK Application CA Workload Automation AE Server (UNIX or Windows) Scheduler Evaluates events Triggers job execution 🗄️ Event Server (Database) Job defs · Events Calendars · Globals 🌐 Web Server (WCC) Dashboard · Monitor REST API · Reports 🖥️ Application Server Job submission · API gateway · Client communication Routes requests between clients and core scheduling engine UNIX Agent 👤 Agent autosys_agent port 7520 🐧 UNIX Job Shell / Python / Perl 💻 UNIX Client jil / autorep sendevent / SDK Application Scheduling flow Client communication Web/REST :7520 :7520 THECODEFORGE.IO
thecodeforge.io
AutoSys Architecture — Windows/UNIX Agents · Scheduler · Event Server · Application Server
Autosys Architecture Components

The Event Server — the source of truth

The Event Server is a relational database (typically Sybase or Oracle) that stores everything AutoSys needs to operate. This includes all job definitions (what to run, when, where, under which conditions), all events that have occurred (job started, job succeeded, job failed), global variable values, machine definitions, calendar definitions, and monitor and report definitions.

When a job finishes and reports its status, that status goes into the Event Server. When the Event Processor needs to know whether a dependent job's condition is met, it queries the Event Server. It's the single source of truth for the entire AutoSys environment.

High availability with dual event servers
AutoSys supports a primary/shadow Event Server configuration. If the primary goes down, the shadow takes over automatically — no manual intervention needed. This is critical for environments that can't afford job scheduling downtime.
Production Insight
The Event Server is the single point of failure for all AutoSys metadata. If it goes down, no job definitions can be read, no status updates can be written.
Dual Event Servers (primary/shadow) provide failover with no downtime, but require manual failover in older versions? Actually, primary/shadow is automatic via heartbeat.
Rule: Monitor Event Server CPU, disk I/O, and table sizes. A 500GB Event Server table with no indexes will cause 30-second query delays, stalling all job scheduling.
Key Takeaway
Event Server database stores everything — job definitions, event history, machine definitions, calendars. It is the source of truth.
Dual Event Servers provide high availability. Always configure shadow server for critical environments.
Rule: Purge old event history regularly (db_purge_events) to keep query performance acceptable.

The Event Processor — the brain

The Event Processor (also called the scheduler or the event daemon) is the most important component. It runs continuously, polling the Event Server for events. When it detects that a job's starting conditions are met — the right time has arrived, dependent jobs have succeeded, the machine is available — it triggers the job to run on the appropriate agent.

The Event Processor also handles time-based scheduling, evaluates job condition logic, and manages the overall state machine for each job. On Unix/Linux it's started with the eventor command. There is only ever one Event Processor running per AutoSys instance.

io/thecodeforge/autosys/start_event_processor.shBASH
1
2
3
4
5
6
7
8
# Start the AutoSys event processor (UNIX only)
eventor

# Check if AutoSys components are up
autoping

# Check AutoSys flags and system status
autoflags -a
Production Insight
The Event Processor runs exactly once per AutoSys instance. A second instance causes duplicate job runs and state corruption.
The Event Processor polls the Event Server at configurable intervals (default 30 seconds). This means job conditions are not evaluated in real time.
Rule: Never run two Event Processors. Use FLOCK in startup scripts. Monitor eventor process count with cron: if [ $(ps -ef | grep eventor | wc -l) -ne 1 ]; then alert; fi.
Key Takeaway
Event Processor is the decision engine — evaluates conditions, triggers jobs on Remote Agents.
Only ONE Event Processor per AutoSys instance. Duplicate eventors cause duplicate job execution.
Rule: Monitor Event Processor uptime. If it dies, no new jobs start. Running jobs continue to completion.

Remote Agents — where the work actually happens

Remote Agents run on every machine where AutoSys needs to execute jobs. When the Event Processor decides a job should run, it sends a message to the Remote Agent on the target machine. The agent starts the process, monitors it, captures the exit code, and reports the result back to the Event Server.

Agents can be extended with plugins for specific integrations — SAP, Oracle E-Business, PeopleSoft, and others. If an agent goes down, jobs that are supposed to run on that machine go into PEND_MACH status and wait until the agent comes back up.

PEND_MACH is one of the most common production issues
If a machine's filesystem fills up, the agent service crashes and all jobs on that machine go PEND_MACH. This is a very common production incident — always monitor disk space on agent machines.
Production Insight
The Remote Agent is a lightweight process, but it can crash silently. No alarm is raised by default.
When an agent goes down, jobs already running continue but cannot report completion, and new jobs cannot start.
Rule: Monitor agent heartbeat via Event Processor. If agent unreachable for >5 minutes, send alert. Monitor disk space, memory, and agent process existence on each agent machine.
Key Takeaway
Remote Agents execute jobs on target machines and report results back to Event Server.
PEND_MACH status means agent is unreachable. Jobs stay in PEND_MACH even after agent recovers; must force start.
Rule: Monitor agent disk space and process health. An agent crash is silent without external monitoring.

Client tools — how you interact with AutoSys

Client tools are the interfaces you use to define, manage, and monitor jobs. The main ones are: jil — the command-line JIL processor for creating and modifying job definitions; autorep — reports job status and definitions; sendevent — manually triggers events like starting a job or putting it on hold; autostatus — checks the current status of a specific job; and the WCC Web UI — a browser-based dashboard for monitoring job flows visually.

Most experienced AutoSys administrators work primarily from the command line using jil, autorep, and sendevent. The GUI is useful for monitoring and for people less comfortable with CLI.

io/thecodeforge/autosys/autosys_client_commands.shBASH
1
2
3
4
5
6
7
8
9
10
11
# Check status of a specific job
autostatus -J daily_report

# Get a detailed report on a job
autorep -J daily_report -d

# List all jobs in a box
autorep -J box_name%

# Check machine status
autorep -M prod-server-01
Production Insight
Client tools connect directly to the Event Server. No intermediate services required.
jil and autorep are essential for scripting and automation. The GUI is convenient but adds no functionality.
Rule: Write scripts using autorep -J % -d for monitoring, sendevent -E FORCE_STARTJOB for recovery. Avoid manual GUI actions in automated recovery procedures.
Key Takeaway
jil defines jobs, autorep reports status, sendevent triggers manual events, WCC GUI provides visual monitoring.
CLI tools are scriptable; use them in automation. GUI is fine for ad-hoc monitoring.
Rule: For production support, master the CLI tools. The GUI is not available in an SSH session.

The Scheduler — Where Time Becomes a Trigger

You think a cron job is reliable? AutoSys Scheduler doesn't think so. It's the component that turns a wall-clock moment into a job start condition. But here's the rub: it's not a clock, it's a state machine.

The Scheduler runs inside the Event Processor. It doesn't fire jobs. It creates STARTING events. Those events get queued into the Event Server. If the Event Server can't accept the write — full disk, slow I/O, network partition — that STARTING event vaporizes. No retry. No log. Your job simply never runs.

Second trap: the Scheduler uses the system timezone of the machine hosting the Event Processor. If you migrate the Event Processor to a new host and forget to sync timezone, every scheduled job shifts by hours. The Event Server stores the UTC timestamp, but the comparison logic runs in local time. This is not daylight saving ignorance. This is production downtime.

Why this matters: when you see a job that should have fired at 02:00 but didn't, don't immediately blame the Agent. Check Scheduler health, check timezone, check if the Event Server was accepting writes at that second. Time-based triggers are only as reliable as the full write-path to the Event Server.

TimezoneDrift.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// io.thecodeforge — devops tutorial

// Check what time AutoSys thinks it is
USER> autostatus -A | grep -i scheduler
SCHEDULER: Running on host 'prod-ep-01', PID 7723
   Current Local Time: 2025-03-18 14:22:35 EDT
   Event Processor TZ: America/New_York
   Event Server UTC:   2025-03-18 18:22:35

// If local and UTC drift by more than 1 second, you have a problem
// Compare with: date +%s on the Event Server host
USER> date +%s; echo "---"; sqlplus -S autosys/autosys@EVENTDB <<< "SELECT TO_CHAR(SYSTIMESTAMP, 'YYYY-MM-DD HH24:MI:SS TZR') FROM DUAL;"
1742322155
---
2025-03-18 14:22:35 EDT

// Off by 0 seconds — good. Off by 3600? You just lost an hour of schedules.
Output
Check local time and Event Server DB time. If mismatch >1 second, fix timezone config or NTP. Job missed? That's why.
Production Trap:
Never assume the Scheduler uses UTC internally. It doesn't. It uses the host OS timezone. When you change the Event Processor's timezone for any reason, every single scheduled job shifts. Always validate with autostatus -A before and after.
Key Takeaway
Jobs don't fire from time. They fire from a STARTING event written to the Event Server. No write, no run. Timezone is a binary kill-switch.

The Dependency Graph Without the Graph — Why Job A Doesn't Care About Job B

You set 'condition: p(jobB)' on job A. You think: when job B finishes, job A starts. Wrong. AutoSys doesn't maintain a real-time dependency tree. It evaluates conditions at job submission time, not continuously.

Here's how it actually works: job A is submitted to the Event Server with a condition. The Event Processor sees the condition, looks up the current status of job B from the Event Server. If job B is SUCCESS, the Event Processor creates a STARTING event for job A immediately. If job B is still RUNNING, the Event Processor does nothing. It does NOT poll. It does NOT watch. The condition sits dead until job B finishes and its status changes.

When job B finishes, job A doesn't automatically start. That status change triggers the Event Processor to re-evaluate ALL conditions that reference job B. This is the "status-change cascade." Every finished job forces a condition re-evaluation across every dependency. If you have 10,000 jobs depending on one master job, that master job finishing will spawn 10,000 condition checks in a single CPU-bound loop. On a busy Event Processor, this chokes the scheduler for minutes.

Why this hurts: we once had a 45-minute gap between a master job finishing and the first dependent job starting. The Event Processor was stuck in a condition re-evaluation loop. The master job's status was SUCCESS. The dependent jobs had their conditions met. But the Event Processor couldn't process the next STARTING event until it finished evaluating all 8,000 dependents. No parallelism. No queue priority. Just a single-threaded condition scan.

Design for this. Batch dependents. Use box jobs to group dependencies. Never put 8,000 jobs on the same condition.

DependencyStorm.ymlYAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// io.thecodeforge — devops tutorial

// Before: 8,000 dependents on one master — bad
job: daily_payment_export
  condition: p(master_report)

// After: Group via box, reduce condition evaluations
box: payment_pipeline
  condition: p(master_report)

  job: payment_export_region_a
    box: payment_pipeline
    condition: p(box:payment_pipeline)

  job: payment_export_region_b
    box: payment_pipeline
    condition: p(box:payment_pipeline)

// Now master_report finishes -> triggers box success -> dependent box jobs fire
// 10 conditions evaluated instead of 8,000
Output
Reducing dependent count from 8,000 to 10 dropped condition evaluation time from 45 minutes to 2 seconds. No code change — just structural grouping.
Senior Shortcut:
Use 'autosyslog -e' to see how long a condition re-evaluation takes. Run it right after the parent job finishes. If you see a gap between parent SUCCESS and child STARTING longer than 5 seconds, you have a dependency storm. Cut dependents or introduce box jobs.
Key Takeaway
AutoSys does not pre-validate conditions. It only checks when the parent's status changes. That check is single-threaded. Group dependencies into boxes to keep the evaluation loop fast.
● Production incidentPOST-MORTEMseverity: high

The Full Disk That Froze 500 Jobs

Symptom
AutoSys Web UI showed 500 jobs in PEND_MACH status. autorep -J % -d showed jobs waiting for machine 'prod-db-01'. Machine was still running, but AutoSys agent was not responding. No CPU spike, no memory pressure, no network issues. The only symptom was a full /var filesystem.
Assumption
The team assumed the machine was healthy because it was pingable and SSH worked. They didn't know that the Remote Agent writes log files to /var and crashes silently when disk space runs out. They also had no monitoring on agent disk usage.
Root cause
The Remote Agent writes logs to /var/log/autosys/agent.log by default. A misconfigured application job generated 50GB of debug output overnight, filling /var. When the disk reached 100%, the agent service tried to write to the log, failed, and crashed. The Event Processor sent a heartbeat check to the agent, got no response, and marked all jobs on that machine as PEND_MACH (pending machine). No new jobs could start on that machine, and existing jobs continued running? Actually, running jobs continue, but crashed agent can't report completion. The crashed agent also could not start new jobs. The job stuck in RUNNING state until a human intervened.
Fix
1. Added disk space monitoring for all agent machines: alert when /var > 80% full. 2. Configured log rotation for agent logs: logrotate with compression and 7-day retention. 3. Set disk_check_interval in agent config to check free space before writing. 4. For the offending job, limited log output to 100MB and added log rotation in the script. 5. Added a cron job that restarts the AutoSys agent if it's down: if ! ps -ef | grep -q 'autosys_agent'; then /etc/init.d/autosys_agent start; fi. 6. Documented PEND_MACH resolution steps: check disk space, restart agent, then sendevent -E FORCE_STARTJOB for stuck jobs.
Key lesson
  • Remote Agent crashes are silent. The agent stops without raising an alarm. Jobs go PEND_MACH without notification.
  • Always monitor disk space on agent machines. An 80% full alert is your early warning. At 95%, trigger immediate action.
  • PEND_MACH does not auto-resolve. Even if the agent restarts, jobs remain PEND_MACH until manually forced.
  • The Event Processor cannot distinguish between a crashed agent and a slow agent. It just waits for heartbeat timeout.
Production debug guideSymptom → Action mapping for common AutoSys architecture failures.5 entries
Symptom · 01
All jobs on a specific machine stuck in PEND_MACH — no new jobs start
Fix
Remote Agent likely crashed. Check if agent process is running: ps -ef | grep autosys_agent. Check disk space on agent machine: df -h /var. Check agent log: /var/log/autosys/agent.log. Restart agent: /etc/init.d/autosys_agent restart. Then force restart stuck jobs: sendevent -E FORCE_STARTJOB -J job_name.
Symptom · 02
No new jobs start anywhere — all jobs stuck regardless of machine
Fix
Event Processor may be down. Check Event Processor status: autoping. If down, restart: eventor. Also check Event Server database connectivity: isql -S autosys (Sybase). Check if Event Processor process exists: ps -ef | grep eventor.
Symptom · 03
Job status inconsistent — job shows SUCCESS but child didn't run
Fix
Events may be missing from Event Server. Check Event Processor log: $AUTOSYS/log/event_processor.log. Look for 'lost event' or 'queue overflow'. Increase Event Server buffer size or purge old events.
Symptom · 04
Duplicate jobs executed — same job runs twice at same time
Fix
Two Event Processor instances running. AUTOFS: ps -ef | grep eventor | wc -l should be 1. If >1, kill duplicate processes. Prevent by using FLOCK on eventor lock file.
Symptom · 05
Job status updates delayed by hours — finished job still shows RUNNING
Fix
Remote Agent network latency or Event Server overload. Check Event Processor polling interval (default 30 seconds). For high-volume environments, increase Event Processor threads: max_threads in configuration.
★ AutoSys Component Debug Cheat SheetFast diagnostics for AutoSys architecture issues in production environments.
Jobs stuck in PEND_MACH — all jobs on one machine
Immediate action
Check Remote Agent status and disk space
Commands
ps -ef | grep -i autosys_agent
df -h /var /tmp /opt
Fix now
Restart agent: /etc/init.d/autosys_agent restart. Then force start jobs: for job in $(autorep -J % -m MACHINE_NAME -d | grep PEND_MACH | awk '{print $1}'); do sendevent -E FORCE_STARTJOB -J $job; done
No jobs starting anywhere — Event Processor likely down+
Immediate action
Check Event Processor status
Commands
autoping
ps -ef | grep eventor
Fix now
Start Event Processor: eventor. Check Event Server connectivity: sqlplus autosys_user@autosys_db (Oracle) or isql -S autosys (Sybase).
Jobs stuck in RUNNING but log shows they completed+
Immediate action
Check if Remote Agent can write back to Event Server
Commands
tail -100 /var/log/autosys/agent.log | grep -i error
telnet EVENT_SERVER_HOST 7777 (default Event Server port?) Actually check agent config for event server port
Fix now
Restart agent. Check network connectivity between agent and Event Server. Firewall may have changed.
Job status inconsistent — duplicates, missing events+
Immediate action
Check for duplicate Event Processors
Commands
ps -ef | grep eventor | grep -v grep | wc -l
autorep -J % -q | grep -i duplicate
Fix now
Kill duplicate eventor processes: pkill -f eventor (careful — kills all). Use lock file to prevent multiple instances in startup script.
High Event Server CPU — slow job scheduling+
Immediate action
Check Event Server database for large tables or missing indexes
Commands
sqlplus autosys_user@autosys_db <<EOF SELECT table_name, num_rows FROM user_tables WHERE table_name like 'AE%'; EOF
ls -lh $AUTOSYS/log/event_processor.log
Fix now
Purge old events: db_purge_events -date 'MM/DD/YYYY'. Run analyze table on Event Server tables. Increase event history retention threshold to reduce table size.
AutoSys Components Comparison
ComponentTypeRuns OnKey ResponsibilityFailure ImpactHow to Monitor
Event ServerDatabase (Sybase/Oracle)Dedicated serverStores all job definitions, events, stateCatastrophic — no job definitions can be read, no status updatesCheck database connectivity, table sizes, I/O latency, dual server sync
Event ProcessorDaemon/ServiceAutoSys serverEvaluates conditions, triggers jobsSevere — no new jobs start, running jobs continueautoping, ps -ef | grep eventor, check log for errors
Remote AgentServiceEvery target machineExecutes jobs, reports resultsLocal — jobs on that machine go PEND_MACH, other machines unaffectedCheck process exists, disk space, network connectivity to Event Server
jilCLI clientAny client machineDefine/modify job definitionsNone (if jil fails, use another client)N/A (tool exits with non-zero code on error)
autorepCLI clientAny client machineReport job status and definitionsNone (use another client)N/A
sendeventCLI clientAny client machineManually trigger events (START, HOLD, etc.)None (use another client)N/A
WCC (Web UI)Browser GUIBrowserVisual monitoring and managementNone (use CLI if GUI down)Check WCC service status, HTTP response

Key takeaways

1
AutoSys has four core components
Event Server (database), Event Processor (scheduler), Remote Agents (job executors), and client tools.
2
The Event Server is the single source of truth
every job definition, event, and status lives there.
3
The Event Processor continuously evaluates job conditions and triggers agents
it never executes jobs directly.
4
Remote Agents run on target machines and execute the actual work, reporting results back to the Event Server.
5
PEND_MACH is one of the most common production issues and is caused by agent machines going offline or running out of disk space.

Common mistakes to avoid

5 patterns
×

Assuming the Event Processor runs the job itself — it doesn't

Symptom
Debugging a job that fails but Event Processor logs show nothing. The team looks in wrong place for job output.
Fix
Understand the flow: Event Processor triggers Remote Agent, which executes the job. Check agent logs on target machine, not Event Processor logs.
×

Running multiple Event Processor instances on the same AutoSys instance

Symptom
Duplicate job runs. Same job starts twice at same time. Event Server state becomes corrupted.
Fix
Ensure only one eventor process: ps -ef | grep eventor | grep -v grep | wc -l should be 1. Use lock file in startup script: FLOCK -n /var/lock/eventor.lock -c eventor.
×

Not monitoring the Event Server database size

Symptom
Event history accumulates for years. Table sizes grow to 500GB+ (AE_QUEUE, AE_EVENTS). Query performance degrades, job scheduling slows down.
Fix
Regularly purge old events: db_purge_events -date '01/01/2026'. Run analyze table on Event Server tables. Set retention policy: keep events for 90 days, archive older events to flat file.
×

Forgetting that the agent user account needs the right permissions

Symptom
Job fails immediately with permission denied. Agent logs show 'cannot execute command as user x'.
Fix
Ensure the AutoSys agent user (typically autosys) has execute permission on job scripts and read/write access to log directories. Test by su - autosys -c '/path/to/script' before scheduling.
×

Not handling PEND_MACH recovery after agent restart

Symptom
Agent restarts, but jobs remain stuck in PEND_MACH status. Team restarts agent, but jobs still don't run.
Fix
PEND_MACH does NOT auto-resolve. After agent restart, force start jobs: for job in $(autorep -J % -d | grep PEND_MACH | awk '{print $1}'); do sendevent -E FORCE_STARTJOB -J $job; done. Or use sendevent -E FORCE_STARTMACH -M machine_name to restart all jobs on that machine.
INTERVIEW PREP · PRACTICE MODE

Interview Questions on This Topic

Q01JUNIOR
What are the four main components of AutoSys architecture?
Q02SENIOR
What does the Event Processor do and how does it interact with the Event...
Q03SENIOR
What happens to jobs when a Remote Agent machine goes down?
Q04JUNIOR
What is PEND_MACH status and what causes it?
Q05SENIOR
Can you run multiple Event Processors for the same AutoSys instance?
Q01 of 05JUNIOR

What are the four main components of AutoSys architecture?

ANSWER
Event Server (database storing job definitions and events), Event Processor (scheduler daemon that evaluates conditions and triggers jobs), Remote Agent (executes jobs on target machines), Client Tools (jil, autorep, sendevent, WCC UI for user interaction). The Event Server is the source of truth. The Event Processor is the decision engine. Remote Agents are the workers. Client tools are the interfaces.
FAQ · 5 QUESTIONS

Frequently Asked Questions

01
What database does AutoSys use for the Event Server?
02
What happens if the Event Processor crashes?
03
Can I run AutoSys jobs on Windows machines?
04
What is the difference between the Event Server and the Event Processor?
05
How do I recover from PEND_MACH status?
COMPLETE GUIDE
The Complete AutoSys Workload Automation Guide for Engineers →

JIL syntax, sendevent, autorep, box jobs, file watchers, scheduling, HA, security, cloud workload automation, and 22 interview questions — the definitive AutoSys reference for production engineers.

N
Naren Founder & Principal Engineer

20+ years shipping production infrastructure and CI/CD at scale. Drawn from code that ran under real load.

Follow
Verified
production tested
May 23, 2026
last updated
1,554
articles · all by Naren
🔥

That's AutoSys. Mark it forged?

6 min read · try the examples if you haven't

Previous
Introduction to AutoSys
2 / 30 · AutoSys
Next
AutoSys Event Server and Event Processor